Search Results: "Vincent Bernat"

16 November 2014

Vincent Bernat: Staging a Netfilter ruleset in a network namespace

A common way to build a firewall ruleset is to run a shell script calling iptables and ip6tables. This is convenient since you get access to variables and loops. There are three major drawbacks with this method:

While the script is running, the firewall is temporarily incomplete. Even if existing connections can be arranged to be left untouched, the new ones may not be allowed to be established (or unauthorized flows may be allowed). Also, essential NAT rules or mangling rules may be absent.
If an error occurs, you are left with an half-working firewall. Therefore, you should ensure that some rules authorizing remote access are set very early. Or implement some kind of automatic rollback system.
Building a large firewall can be slow. Each ip ,6 tables command will download the ruleset from the kernel, add the rule and upload the whole modified ruleset to the kernel.

Using iptables-restore A classic way to solve these problems is to build a rule file that will be read by iptables-restore and ip6tables-restore¹. Those tools send the ruleset to the kernel in one pass. The kernel applies it atomically. Usually, such a file is built with ip ,6 tables-save but a script can fit the task. The ruleset syntax understood by ip ,6 tables-restore is similar to the syntax of ip ,6 tables but each table has its own block and chain declaration is different. See the following example:

$ iptables -P FORWARD DROP
$ iptables -t nat -A POSTROUTING -s 192.168.0.0/24 -j MASQUERADE
$ iptables -N SSH
$ iptables -A SSH -p tcp --dport ssh -j ACCEPT
$ iptables -A INPUT -i lo -j ACCEPT
$ iptables -A OUTPUT -o lo -j ACCEPT
$ iptables -A FORWARD -m state --state ESTABLISHED,RELATED -j ACCEPT
$ iptables -A FORWARD -j SSH
$ iptables-save
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
-A POSTROUTING -s 192.168.0.0/24 -j MASQUERADE
COMMIT
*filter
:INPUT ACCEPT [0:0]
:FORWARD DROP [0:0]
:OUTPUT ACCEPT [0:0]
:SSH - [0:0]
-A INPUT -i lo -j ACCEPT
-A FORWARD -m state --state RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -j SSH
-A OUTPUT -o lo -j ACCEPT
-A SSH -p tcp -m tcp --dport 22 -j ACCEPT
COMMIT

As you see, we have one block for the nat table and one block for the filter table. The user-defined chain SSH is declared at the top of the filter block with other builtin chains. Here is a script diverting ip ,6 tables commands to build such a file (heavily relying on some Zsh-fu²):

#!/bin/zsh
set -e
work=$(mktemp -d)
trap "rm -rf $work" EXIT
#   Redefine ip ,6 tables
iptables()  
    # Intercept -t
    local table="filter"
    [[ -n $ @[(r)-t]  ]] &&  
        # Which table?
        local index=$ (k)@[(r)-t] 
        table=$ @[(( index + 1 ))] 
        argv=( $argv[1,(( $index - 1 ))] $argv[(( $index + 2 )),$#] )
     
    [[ -n $ @[(r)-N]  ]] &&  
        # New user chain
        local index=$ (k)@[(r)-N] 
        local chain=$ @[(( index + 1 ))] 
        print ":$ chain  -" >> $ work /$ 0 -$ table -userchains
        return
     
    [[ -n $ @[(r)-P]  ]] &&  
        # Policy for a builtin chain
        local index=$ (k)@[(r)-P] 
        local chain=$ @[(( index + 1 ))] 
        local policy=$ @[(( index + 2 ))] 
        print ":$ chain  $ policy " >> $ work /$ 0 -$ table -policy
        return
     
    # iptables-restore only handle double quotes
    echo $ $ (q-)@ //\'/\"  >> $ work /$ 0 -$ table -rules #'
 
functions[ip6tables]=$ functions[iptables] 
#   Build the final ruleset that can be parsed by ip ,6 tables-restore
save()  
    for table ($ work /$ 1 -*-rules(:t:s/-rules//))  
        print "*$ $ table #$ 1 - "
        [ ! -f $ work /$ table -policy ]   cat $ work /$ table -policy
        [ ! -f $ work /$ table -userchains   cat $ work /$ table -userchains
        cat $ work /$ table -rules
        print "COMMIT"
     
 
#   Execute rule files
for rule in $(run-parts --list --regex '^[.a-zA-Z0-9_-]+$' $ 0%/* /rules); do
    . $rule
done
#   Execute rule files
ret=0
save iptables    iptables-restore    ret=$?
save ip6tables   ip6tables-restore   ret=$?
exit $ret

In , a new iptables() function is defined and will shadow the iptables command. It will try to locate the -t parameter to know which table should be used. If such a parameter exists, the table is remembered in the $table variable and removed from the list of arguments. Defining a new chain (with -N) is also handled as well as setting the policy (with -P). In , the save() function will output a ruleset that should be parseable by ip ,6 tables-restore. In , user rules are executed. Each ip ,6 tables command will call the previously defined function. When no error has occurred, in , ip ,6 tables-restore is invoked. The command will either succeed or fail. This method works just fine³. However, the second method is more elegant.

Using a network namespace An hybrid approach is to build the firewall rules with ip ,6 tables in a newly created network namespace, save it with ip ,6 tables-save and apply it in the main namespace with ip ,6 tables-restore. Here is the gist (still using Zsh syntax):

#!/bin/zsh
set -e
alias main='/bin/true  '
[ -n $iptables ]    
    #   Execute ourself in a dedicated network namespace
    iptables=1 unshare --net -- \
        $0 4> >(iptables-restore) 6> >(ip6tables-restore)
    #   In main namespace, disable iptables/ip6tables commands
    alias iptables=/bin/true
    alias ip6tables=/bin/true
    alias main='/bin/false  '
 
#   In both namespaces, execute rule files
for rule in $(run-parts --list --regex '^[.a-zA-Z0-9_-]+$' $ 0%/* /rules); do
    . $rule
done
#   In test namespace, save the rules
[ -z $iptables ]    
    iptables-save >&4
    ip6tables-save >&6

In , the current script is executed in a new network namespace. Such a namespace has its own ruleset that can be modified without altering the one in the main namespace. The $iptables environment variable tell in which namespace we are. In the new namespace, we execute all the rule files ( ). They contain classic ip ,6 tables commands. If an error occurs, we stop here and nothing happens, thanks to the use of set -e. Otherwise, in , the ruleset of the new namespace are saved using ip ,6 tables-save and sent to dedicated file descriptors. Now, the execution in the main namespace resumes in . The results of ip ,6 tables-save are feeded to ip ,6 tables-restore. At this point, the firewall is mostly operational. However, we will play again the rule files ( ) but the ip ,6 tables commands will be disabled ( ). Additional commands in the rule files, like enabling IP forwarding, will be executed. The new namespace does not provide the same environment as the main namespace. For example, there is no network interface in it, so we cannot get or set IP addresses. A command that must not be executed in the new namespace should be prefixed by main:

main ip addr add 192.168.15.1/24 dev lan-guest

You can look at a complete example on GitHub.

Another nifty tool is iptables-apply which will apply a rule file and rollback after a given timeout unless the change is confirmed by the user.
As you can see in the snippet, Zsh comes with some powerful features to handle arrays. Another big advantage of Zsh is it does not require quoting every variable to avoid field splitting. Hence, the script can handle values with spaces without a problem, making it far more robust.
If I were nitpicking, there are three small flaws with it. First, when an error occurs, it can be difficult to match the appropriate location in your script since you get the position in the ruleset instead. Second, a table can be used before it is defined. So, it may be difficult to spot some copy/paste errors. Third, the IPv4 firewall may fail while the IPv6 firewall is applied, and vice-versa. Those flaws are not present in the next method.

Vincent Bernat: Intel Wireless 7260 as an access point

My home router acts as an access point with an Intel Dual-Band Wireless-AC 7260 wireless card. This card supports 802.11ac (on the 5 GHz band) and 802.11n (on both the 5 GHz and 2.4 GHz band). While this seems a very decent card to use in managed mode, this is not really a great choice for an access point.

$ lspci -k -nn -d 8086:08b1
03:00.0 Network controller [0280]: Intel Corporation Wireless 7260 [8086:08b1] (rev 73)
        Subsystem: Intel Corporation Dual Band Wireless-AC 7260 [8086:4070]
        Kernel driver in use: iwlwifi

TL;DR: Use an Atheros card instead.

Limitations First, the card is said dual-band but you can only uses one band at a time because there is only one radio. Almost all wireless cards have this limitation. If you want to use both the 2.4 GHz band and the less crowded 5 GHz band, two cards are usually needed.

5 GHz band There is no support to set an access point on the 5 GHz band. The firmware doesn t allow it. This can be checked with iw:

$ iw reg get
country CH: DFS-ETSI
        (2402 - 2482 @ 40), (N/A, 20), (N/A)
        (5170 - 5250 @ 80), (N/A, 20), (N/A)
        (5250 - 5330 @ 80), (N/A, 20), (0 ms), DFS
        (5490 - 5710 @ 80), (N/A, 27), (0 ms), DFS
        (57240 - 65880 @ 2160), (N/A, 40), (N/A), NO-OUTDOOR
$ iw list
Wiphy phy0
[...]
        Band 2:
                Capabilities: 0x11e2
                        HT20/HT40
                        Static SM Power Save
                        RX HT20 SGI
                        RX HT40 SGI
                        TX STBC
                        RX STBC 1-stream
                        Max AMSDU length: 3839 bytes
                        DSSS/CCK HT40
                Frequencies:
                        * 5180 MHz [36] (20.0 dBm) (no IR)
                        * 5200 MHz [40] (20.0 dBm) (no IR)
                        * 5220 MHz [44] (20.0 dBm) (no IR)
                        * 5240 MHz [48] (20.0 dBm) (no IR)
                        * 5260 MHz [52] (20.0 dBm) (no IR, radar detection)
                          DFS state: usable (for 192 sec)
                          DFS CAC time: 60000 ms
                        * 5280 MHz [56] (20.0 dBm) (no IR, radar detection)
                          DFS state: usable (for 192 sec)
                          DFS CAC time: 60000 ms
[...]

While the 5 GHz band is allowed by the CRDA, all frequencies are marked with no IR. Here is the explanation for this flag:

The no-ir flag exists to allow regulatory domain definitions to disallow a device from initiating radiation of any kind and that includes using beacons, so for example AP/IBSS/Mesh/GO interfaces would not be able to initiate communication on these channels unless the channel does not have this flag.

Multiple SSID This card can only advertise one SSID. Managing several of them is useful to setup distinct wireless networks, like a public access (routed to Tor), a guest access and a private access. iw can confirm this:

$ iw list
        valid interface combinations:
                 * #  managed   <= 1, #  AP, P2P-client, P2P-GO   <= 1, #  P2P-device   <= 1,
                   total <= 3, #channels <= 1

Here is the output of an Atheros card able to manage 8 SSID:

$ iw list
        valid interface combinations:
                 * #  managed, WDS, P2P-client   <= 2048, #  IBSS, AP, mesg point, P2P-GO   <= 8,
                   total <= 2048, #channels <= 1

Configuration as an access point Except for those two limitations, the card works fine as an access point. Here is the configuration that I use for hostapd:

interface=wlan-guest
driver=nl80211
# Radio
ssid=XXXXXXXXX
hw_mode=g
channel=11
# 802.11n
wmm_enabled=1
ieee80211n=1
ht_capab=[HT40-][SHORT-GI-20][SHORT-GI-40][DSSS_CCK-40][DSSS_CCK-40][DSSS_CCK-40]
# WPA
auth_algs=1
wpa=2
wpa_passphrase=XXXXXXXXXXXXXXX
wpa_key_mgmt=WPA-PSK
wpa_pairwise=TKIP
rsn_pairwise=CCMP

Because of the use of channel 11, only 802.11n HT40- rate can be enabled. Look at the Wikipedia page for 802.11n to check if you can use either HT40-, HT40+ or both.

Vincent Bernat: Replacing Swisscom router by a Linux box

I have recently moved to Lausanne, Switzerland. Broadband Internet access is not as cheap as in France. Free, a French ISP, is providing an FTTH access with a bandwith of 1 Gbps¹ for about 38 (including TV and phone service), Swisscom is providing roughly the same service for about 200 ². Swisscom fiber access was available for my appartment and I chose the 40 Mbps contract without phone service for about 80 . Like many ISP, Swisscom provides an Internet box with an additional box for TV. I didn t unpack the TV box as I have no use for it. The Internet box comes with some nice features like the ability to setup firewall rules, a guest wireless access and some file sharing possibilities. No shell access! I have bought a small PC to act as router and replace the Internet box. I have loaded the upcoming Debian Jessie on it. You can find the whole software configuration in a GitHub repository. This blog post only covers the Swisscom-specific setup (and QoS). Have a look at those two blog posts for related topics:

Ethernet The Internet box is packed with a Siligence-branded 1000BX SFP³. This SFP receives and transmits data on the same fiber using a different wavelength for each direction. Instead of using a network card with an SFP port, I bought a Netgear GS110TP which comes with 8 gigabit copper ports and 2 fiber SFP ports. It is a cheap switch bundled with many interesting features like VLAN and LLDP. It works fine if you don t expect too much from it.

IPv4 IPv4 connectivity is provided over VLAN 10. A DHCP client is mandatory. Moreover, the DHCP vendor class identifier option (option 60) needs to be advertised. This can be done by adding the following line to /etc/dhcp/dhclient.conf when using the ISC DHCP client:

send vendor-class-identifier "100008,0001,,Debian";

The first two numbers are here to identify the service you are requesting. I suppose this can be read as requesting the Swisscom residential access service. You can put whatever you want after that. Once you get a lease, you need to use a browser to identify yourself to Swisscom on the first use.

IPv6 Swisscom provides IPv6 access through the 6rd protocol. This is a tunneling mechanism to facilitate IPv6 deployment accross an IPv4 infrastructure. This kind of tunnel is natively supported by Linux since kernel version 2.6.33. To setup IPv6, you need the base IPv6 prefix and the 6rd gateway. Some ISP are providing those values through DHCP (option 212) but this is not the case for Swisscom. The gateway is 6rd.swisscom.com and the prefix is 2a02:1200::/28. After appending the IPv4 address to the prefix, you still get 4 bits for internal subnets. Swisscom doesn t provide a fixed IPv4 address. Therefore, it is not possible to precompute the IPv6 prefix. When installed as a DHCP hook (in /etc/dhcp/dhclient-exit-hooks.d/6rd), the following script configures the tunnel:

sixrd_iface=internet6
sixrd_mtu=1472                  # This is 1500 - 20 - 8 (PPPoE header)
sixrd_ttl=64
sixrd_prefix=2a02:1200::/28     # No way to guess, just have to know it.
sixrd_br=193.5.29.1             # That's "6rd.swisscom.com"
sixrd_down()  
    ip tunnel del $ sixrd_iface    true
 
sixrd_up()  
    ipv4=$ new_ip_address:-$old_ip_address 
    sixrd_subnet=$(ruby <<EOF
require 'ipaddr'
prefix = IPAddr.new "$ sixrd_prefix ", Socket::AF_INET6
prefixlen = $ sixrd_prefix#*/ 
ipv4 = IPAddr.new "$ ipv4 ", Socket::AF_INET
ipv6 = IPAddr.new (prefix.to_i + (ipv4.to_i << (64 + 32 - prefixlen))), Socket::AF_INET6
puts ipv6
EOF
)
    # Let's configure the tunnel
    ip tunnel add $ sixrd_iface  mode sit local $ipv4 ttl $sixrd_ttl
    ip tunnel 6rd dev $ sixrd_iface  6rd-prefix $ sixrd_prefix 
    ip addr add $ sixrd_subnet 1/64 dev $ sixrd_iface 
    ip link set mtu $ sixrd_mtu  dev $ sixrd_iface 
    ip link set $ sixrd_iface  up
    ip route add default via ::$ sixrd_br  dev $ sixrd_iface 
 
case $reason in
    BOUND REBOOT)
        sixrd_down
        sixrd_up
        ;;
    RENEW REBIND)
        if [ "$new_ip_address" != "$old_ip_address" ]; then
            sixrd_down
            sixrd_up
        fi
        ;;
    STOP EXPIRE FAIL RELEASE)
        sixrd_down
        ;;
esac

The computation of the IPv6 prefix is offloaded to Ruby instead of trying to use the shell for that. Even if the ipaddr module is pretty basic , it suits the job. Swisscom is using the same MTU for all clients. Because some of them are using PPPoE, the MTU is 1472 instead of 1480. You can easily check your MTU with this handy online MTU test tool. It is not uncommon that PMTUD is broken on some parts of the Internet. While not ideal, setting up TCP MSS will alievate any problem you may run into with a MTU less than 1500:

ip6tables -t mangle -A POSTROUTING -o internet6 \
          -p tcp --tcp-flags SYN,RST SYN \
          -j TCPMSS --clamp-mss-to-pmtu

QoS UPDATED: Unfortunately, this section is incorrect, including its premise. Have a look at Dave Taht comment for more details. Once upon a time, QoS was a tacky subject. The Wonder Shaper was a common way to get a somewhat working setup. Nowadays, thanks to the work of the Bufferbloat project, there are two simple steps to get something quite good:

~~Reduce the queue of your devices to something like 32 packets. This helps TCP to detect congestion and act accordingly while still being able to saturate a gigabit link.~~
```
ip link set txqueuelen 32 dev lan
ip link set txqueuelen 32 dev internet
ip link set txqueuelen 32 dev wlan
```
Change the root qdisc to fq_codel. A qdisc receives packets to be sent from the kernel and decide how packets are handled to the network card. Packets can be dropped, reordered or rate-limited. fq_codel is a queuing discipline combining fair queuing and controlled delay. Fair queuing means that all flows get an equal chance to be served. Another way to tell it is that a high-bandwidth flow won t starve the queue. Controlled delay means that the queue size will be limited to ensure the latency stays low. This is achieved by dropping packets more aggressively when the queue grows.
```
tc qdisc replace dev lan root fq_codel
tc qdisc replace dev internet root fq_codel
tc qdisc replace dev wlan root fq_codel
```

Maximum download speed is 1 Gbps, while maximum upload speed is 200 Mbps.
This is the standard Vivo XL package rated at CHF 169. plus the 1 Gbps option at CHF 80. .
There are two references on it: SGA 441SFP0-1Gb and OST-1000BX-S34-10DI. It transmits to the 1310 nm wave length and receives on the 1490 nm one.

1 August 2014

Raphaël Hertzog: My Free Software Activities in July 2014

This is my monthly summary of my free software related activities. If you re among the people who made a donation to support my work (548.59 , thanks everybody!), then you can learn how I spent your money. Otherwise it s just an interesting status update on my various projects. Distro Tracker Now that tracker.debian.org is live, people reported bugs (on the new tracker.debian.org pseudo-package that I requested) faster than I could fix them. Still I spent many, many hours on this project, reviewing submitted patches (thanks to Christophe Siraut, Joseph Herlant, Dimitri John Ledkov, Vincent Bernat, James McCoy, Andrew Starr-Bochicchio who all submitted some patches!), fixing bugs, making sure the code works with Django 1.7, and started the same with Python 3. I added a tox.ini so that I can easily run the test suite in all 4 supported environments (created by tox as virtualenv with the combinations of Django 1.6/1.7 and Python 2.7/3.4). Over the month, the git repository has seen 73 commits, we fixed 16 bugs and other issues that were only reported over IRC in #debian-qa. With the help of Enrico Zini and Martin Zobel, we enabled the possibility to login via sso.debian.org (Debian s official SSO) so that Debian developers don t even have to explicitly create their account. As usual more help is needed and I ll gladly answer your questions and review your patches. Misc packaging work Publican. I pushed a new upstream release of publican and dropped a useless build-dependency that was plagued by a difficult to fix RC bug (#749357 for the curious, I tried to investigate but it needs major work for make 4.x compatibility). GNOME 3.12. With gnome-shell 3.12 hitting unstable, I had to update gnome-shell-timer (and filed an upstream ticket at the same time), a GNOME Shell extension to start some run-down counters. Django 1.7. I packaged python-django 1.7 release candidate 1 in experimental (found a small bug, submitted a ticket with a patch that got quickly merged) and filed 85 bugs against all the reverse dependencies to ask their maintainers to test their package with Django 1.7 (that we want to upload before the freeze obviously). We identified a pain point in upgrade for packages using South and tried to discuss it with upstream, but after closer investigation, none of the packages are really affected. But the problem can hit administrators of non-packaged Django applications. Misc stuff. I filed a few bugs (#754282 against git-import-orig uscan, #756319 against wnpp to see if someone would be willing to package loomio), reviewed an updated package for django-ratelimit in #755611, made a non-maintainer upload of mairix (without prior notice) to update the package to a new upstream release and bring it to modern packaging norms (Mako failed to make an upload in 4 years so I just went ahead and did what I would have done if it were mine). Kali work resulting in Debian contributions Kali wants to switch from being based on stable to being based on testing so I did try to setup britney to manage a new kali-rolling repository and encountered some problems that I reported to debian-release. Niels Thykier has been very helpful and even managed to improve britney thanks to the very specific problem that the kali setup triggered. Since we use reprepro, I did write some Python wrapper to transform the HeidiResult file in a set of reprepro commands but at the same time I filed #756399 to request proper support of heidi files in reprepro. While analyzing britney s excuses file, I also noticed that the Kali mirrors contains many source packages that are useless because they only concern architectures that we don t host (and I filed #756523 against reprepro). While trying to build a live image of kali-rolling, I noticed that libdb5.1 and db5.1-util were still marked as priority standard when in fact Debian already switched to db5.3 and thus should only be optional (I filed #756623 against ftp.debian.org). When doing some upgrade tests from kali (wheezy based) to kali-rolling (jessie based) I noticed some problems that were also affecting Debian Jessie. I filed #756629 against libfile-fcntllock-perl (with a patch), and also #756618 against texlive-base (missing Replaces header). I also pinged Colin Watson on #734946 because I got a spurious base-passwd prompt during upgrade (that was triggered because schroot copied my unstable s /etc/passwd file in the kali chroot and the package noticed a difference on the shell of all system users). Thanks See you next month for a new summary of my activities.

One comment Liked this article? Click here. My blog is Flattr-enabled.

5 May 2014

Vincent Bernat: Dashkiosk: manage dashboards on multiple displays

Dashkiosk is a solution to manage dashboards on multiple displays. It comes in four parts:

A server will manage the screens by sending URL to be displayed. A web interface enables an administrator to configure groups of dashboards and attach them to a set of displays.
A receiver runs in a browser attached to each screen. On start, it contacts the server and waits for the URL to display.
An Android application provides a simple fullscreen webview to display the receiver.
A Chromecast custom receiver which will run the regular receiver to display dashboards using Google Chromecast devices. The server is able to drive Chromecast devices through nodecastor, a reimplementation of the sender API.

For a demo, have a look at the following video (it is also available as an Ogg Theora video).

27 April 2014

Vincent Bernat: Local corporate APT repositories

Distributing software efficiently accross your platform can be difficult. Every distribution comes with a package manager which is usually suited for this task. APT can be relied upon on when using Debian or a derivative. Unfortunately, the official repositories may not contain everything you need. When you require unpackaged software or more recent versions, it is possible to setup your own local repository. Most of what is presented here was setup for Dailymotion and was greatly inspired by the work done by Rapha l Pinson at Orange.

Setting up your repositories There are three kinds of repositories you may want to setup:

A distribution mirror. Such a mirror will save bandwidth, provide faster downloads and permanent access, even when someone searches Google on Google.
A local repository for your own packages with the ability to have a staging zone to test packages on some servers before putting them in production.
Mirrors for unofficial repositories, like Ubuntu PPA. To avoid unexpected changes, such a repository will also get a staging and a production zone.

Before going further, it is quite important to understand what a repository is. Let s illustrate with the following line from my /etc/apt/sources.list:

deb http://ftp.debian.org/debian/ unstable main contrib non-free

In this example, http://ftp.debian.org/debian/ is the repository and unstable is the distribution. A distribution is subdivided into components. We have three components: main, contrib and non-free. To setup repositories, we will use reprepro. This is not the only solution but it has a good balance between versatility and simplicity. reprepro can only handle one repository. So, the first choice is about how you will split your packages in repositories, distributions and components. Here is what matters:

A repository cannot contain two identical packages (same name, same version, same architecture).
Inside a component, you can only have one version of a package.
Usually, a distribution is a subset of the versions while a component is a subset of the packages. For example, in Debian, with the distribution unstable, you choose to get the most recent versions while with the component main, you choose to get DFSG-free software only.

If you go for several repositories, you will have to handle several reprepro instances and won t be able to easily copy packages from one place to another. At Dailymotion, we put everything in the same repository but it would also be perfectly valid to have three repositories:

one to mirror the distribution,
one for your local packages, and
one to mirror unofficial repositories.

Here is our target setup: Local APT repository

Initial setup First, create a system user to work with the repositories:

$ adduser --system --disabled-password --disabled-login \
>         --home /srv/packages \
>         --group reprepro

All operations should be done with this user only. If you want to setup several repositories, create a directory for each of them. Each repository has those subdirectories:

conf/ contains the configuration files,
gpg/ contains the GPG stuff to sign the repository¹,
logs/ contains the logs,
www/ contains the repository that should be exported by the web server.

Here is the content of conf/options:

outdir +b/www
logdir +b/logs
gnupghome +b/gpg

Then, you need to create the GPG key to sign the repository:

$ GNUPGHOME=gpg gpg --gen-key
Please select what kind of key you want:
   (1) RSA and RSA (default)
   (2) DSA and Elgamal
   (3) DSA (sign only)
   (4) RSA (sign only)
Your selection? 1
RSA keys may be between 1024 and 4096 bits long.
What keysize do you want? (2048) 4096
Requested keysize is 4096 bits
Please specify how long the key should be valid.
         0 = key does not expire
      <n>  = key expires in n days
      <n>w = key expires in n weeks
      <n>m = key expires in n months
      <n>y = key expires in n years
Key is valid for? (0) 10y
Key expires at mer. 08 nov. 2023 22:30:58 CET
Is this correct? (y/N) y
Real name: Dailymotion Archive Automatic Signing Key
Email address: the-it-operations@dailymotion.com
Comment: 
[...]

By setting an empty password, you allow reprepro to run unattended. You will have to distribute the public key of your new repository to let APT check the archive signature. An easy way is to ship it in some package.

Local mirror of an official distribution Let s start by mirroring a distribution. We want a local mirror of Ubuntu Precise. For this, we need to do two things:

Setup a new distribution in conf/distributions.
Configure the update sources in conf/updates.

Let s add this block to conf/distributions:

# Ubuntu Precise
Origin: Ubuntu
Label: Ubuntu
Suite: precise
Version: 12.04
Codename: precise
Architectures: i386 amd64
Components: main restricted universe multiverse
UDebComponents: main restricted universe multiverse
Description: Ubuntu Precise 12.04 (with updates and security)
Contents: .gz .bz2
UDebIndices: Packages Release . .gz
Tracking: minimal
Update: - ubuntu-precise ubuntu-precise-updates ubuntu-precise-security
SignWith: yes

This defines the precise distribution in our repository. It contains four components: main, restricted, universe and multiverse (like the regular distribution in official repositories). The Update line starts with a dash. This means reprepro will mark everything as deleted before updating with the provided sources. Old packages will not be kept when they are removed from Ubuntu. In conf/updates, we define the sources:

# Ubuntu Precise
Name: ubuntu-precise
Method: http://fr.archive.ubuntu.com/ubuntu
Fallback: http://de.archive.ubuntu.com/ubuntu
Suite: precise
Components: main main multiverse restricted universe
UDebComponents: main restricted universe multiverse
Architectures: amd64 i386
VerifyRelease: 437D05B5
GetInRelease: no
# Ubuntu Precise Updates
Name: ubuntu-precise-updates
Method: http://fr.archive.ubuntu.com/ubuntu
Fallback: http://de.archive.ubuntu.com/ubuntu
Suite: precise-updates
Components: main restricted universe multiverse
UDebComponents: main restricted universe multiverse
Architectures: amd64 i386
VerifyRelease: 437D05B5
GetInRelease: no
# Ubuntu Precise Security
Name: ubuntu-precise-security
Method: http://fr.archive.ubuntu.com/ubuntu
Fallback: http://de.archive.ubuntu.com/ubuntu
Suite: precise-security
Components: main restricted universe multiverse
UDebComponents: main restricted universe multiverse
Architectures: amd64 i386
VerifyRelease: 437D05B5
GetInRelease: no

The VerifyRelease lines are GPG key fingerprint to use to check the remote repository. The key needs to be imported in the local keyring:

$ gpg --keyring /usr/share/keyrings/ubuntu-archive-keyring.gpg \
>     --export 437D05B5   GNUPGHOME=gpg gpg --import

Another important point is that we merge three distributions (precise, precise-updates and precise-security) into a single distribution (precise) in our local repository. This may cause some difficulties with tools expecting the three distributions to be available (like the Debian Installer²). Next, you can run reprepro and ask it to update your local mirror:

$ reprepro update

This will take some time on the first run. You can execute this command every night. reprepro is not the fastest mirror solution but it is easy to setup, flexible and reliable.

Repository for local packages Let s configure the repository to accept local packages. For each official distribution (like precise), we will configure two distributions:

precise-staging contains packages that have not been fully tested and not ready to go to production.
precise-prod contains production packages copied from precise-staging.

In our workflow, packages are introduced in precise-staging where they can be tested and will be copied to precise-prod when we want them to be available for production. You can adopt a more complex workflow if you need. The reprepro part is quite easy. We add the following blocks into conf/distributions:

# Dailymotion Precise packages (staging)
Origin: Dailymotion #  
Label: dm-staging   #  
Suite: precise-staging
Codename: precise-staging
Architectures: i386 amd64 source
Components: main role/dns role/database role/web #  
Description: Dailymotion Precise staging repository
Contents: .gz .bz2
Tracking: keep
SignWith: yes
NotAutomatic: yes #  
Log: packages.dm-precise-staging.log
 --type=dsc email-changes
# Dailymotion Precise packages (prod)
Origin: Dailymotion #  
Label: dm-prod      #  
Suite: precise-prod
Codename: precise-prod
Architectures: i386 amd64 source
Components: main role/dns role/database role/web #  
Description: Dailymotion Precise prod repository
Contents: .gz .bz2
Tracking: keep
SignWith: yes
Log: packages.dm-precise-prod.log

First notice we use several components (in ):

main will contain packages that are not specific to a subset of the platform. If you put a package in main, it should work correctly on any host.
role/* are components dedicated to a subset of the platform. For example, in role/dns, we ship a custom version of BIND.

The staging distribution has the NotAutomatic flag (in ) which disallows the package manager to install those packages except if the user explicitely requests it. Just below, when a new dsc file is uploaded, the hook email-changes will be executed. It should be in the conf/ directory. The Origin and Label lines (in ) are quite important to be able to define an explicit policy of which packages should be installed. Let s say we use the following /etc/apt/sources.list file:

# Ubuntu packages
deb http://packages.dm.gg/dailymotion precise main restricted universe multiverse
# Dailymotion packages
deb http://packages.dm.gg/dailymotion precise-prod    main role/dns
deb http://packages.dm.gg/dailymotion precise-staging main role/dns

All servers have the precise-staging distribution. We must ensure we won t install those packages by mistake. The NotAutomatic flag is one possible safe-guard. We also use a tailored /etc/apt/preferences:

Explanation: Dailymotion packages of a specific component should be more preferred
Package: *
Pin: release o=Dailymotion, l=dm-prod, c=role/*
Pin-Priority: 950
Explanation: Dailymotion packages should be preferred
Package: *
Pin: release o=Dailymotion, l=dm-prod
Pin-Priority: 900
Explanation: staging should never be preferred
Package: *
Pin: release o=Dailymotion, l=dm-staging
Pin-Priority: -100

By default, packages will have a priority of 500. By setting a priority of -100 to the staging distribution, we ensure the packages cannot be installed at all. This is stronger than NotAutomatic which sets the priority to 1. When a package exists in Ubuntu and in our local repository, we ensure that, if this is a production package, we will use ours by using a priority of 900 (or 950 if we match a specific role component). Have a look at the How APT Interprets Priorities section of apt_preferences(5) manual page for additional information. Keep in mind that version matters only when the priority is the same. To check if everything works as you expect, use apt-cache policy:

$ apt-cache policy php5-memcache
  Installed: 3.0.8-1~precise2~dm1
  Candidate: 3.0.8-1~precise2~dm1
  Version table:
 *** 3.0.8-1~precise2~dm1 0
        950 http://packages.dm.gg/dailymotion/ precise-prod/role/web amd64 Packages
        100 /var/lib/dpkg/status
     3.0.8-1~precise1~dm4 0
        900 http://packages.dm.gg/dailymotion/ precise-prod/main amd64 Packages
       -100 http://packages.dm.gg/dailymotion/ precise-staging/main amd64 Packages
     3.0.6-1 0
        500 http://packages.dm.gg/dailymotion/ precise/universe amd64 Packages

If we want to install a package from the staging distribution, we can use apt-get with the -t precise-staging option to raise the priority of this distribution to 990. Once you have tested your package, you can copy it from the staging distribution to the production distribution:

$ reprepro -C main copysrc precise-prod precise-staging wackadoodle

Local mirror of third-party repositories Sometimes, you want a software published on some third-party repository without to repackage it yourself. A common example is the repositories edited by hardware vendors. Like for an Ubuntu mirror, there are two steps: defining the distribution and defining the source. We chose to put such mirrors into the same distributions as our local packages but with a dedicated component for each mirror. This way, those third-party packages will share the same workflow as our local packages: they will appear in the staging distribution, we validate them and copy them to the production distribution. The first step is to add the components and an appropriate Update line to conf/distributions:

Origin: Dailymotion
Label: dm-staging
Suite: precise-staging
Components: main role/dns role/database role/web vendor/hp
Update: hp
# [...]
Origin: Dailymotion
Label: dm-prod
Suite: precise-prod
Components: main role/dns role/database role/web vendor/hp
# [...]

We added the vendor/hp component to both the staging and the production distributions. However, only the staging distribution gets an Update line (remember, packages will be copied manually into the production distribution). We declare the source in conf/updates:

# HP repository
Name: hp
Method: http://downloads.linux.hp.com/SDR/downloads/ManagementComponentPack/
Suite: precise/current
Components: non-free>vendor/hp
Architectures: i386 amd64
VerifyRelease: 2689B887
GetInRelease: no

Don t forget to add the GPG key to your local keyring. Notice an interesting feature of reprepro: we copy the remote non-free component to our local vendor/hp component. Then, you can synchronize the mirror with reprepro update. Once the packages have been tested, you will have to copy them in the production distribution.

Building Debian packages Our reprepro setup seems complete, but how do we put packages into the staging distribution? You have several options to build Debian packages for your local repository. It really depends on how much time you want to invest in this activity:

Build packages from source by adding a `debian/` directory. This is the classic way of building Debian packages. You can start from scratch or use an existing package as a base. In the latest case, the package can be from the official archive but for a more recent distribution or a backport or from an unofficial repository.

Use a tool that will create a binary package from a directory, like fpm. Such a tool will try to guess a lot of things to minimize your work. It can even download everything for you.

There is no universal solution. If you don t have the time budget for building packages from source, have a look at fpm. I would advise you to use the first approach when possible because you will get those perks for free:

You keep the sources in your repository. Whenever you need to rebuild something to fix an emergency bug, you won t have to hunt the sources which may be unavailable when you need them the most. Of course, this only works if you build packages that don t download stuff directly from the Internet.

You also keep the recipe³ to build the package in your repository. If someone enables some option and rebuild the package, you won t accidently drop this option on the next build. Those changes can be documented in `debian/changelog`. Moreover, you can use a version control software for the whole `debian/` directory.

You can propose your package for inclusion into Debian. This will help many people once the package hits the archive.

Builders We chose pbuilder as a builder⁴. Its setup is quite straightforward. Here is our /etc/pbuilderrc:

DISTRIBUTION=$DIST
NAME="$DIST-$ARCH"
MIRRORSITE=http://packages.dm.gg/dailymotion
COMPONENTS=("main" "restricted" "universe" "multiverse")
OTHERMIRROR="deb http://packages.dm.gg/dailymotion $ DIST -staging main"
HOOKDIR=/etc/pbuilder/hooks.d
BASE=/var/cache/pbuilder/dailymotion
BASETGZ=$BASE/$NAME/base.tgz
BUILDRESULT=$BASE/$NAME/results/
APTCACHE=$BASE/$NAME/aptcache/
DEBBUILDOPTS="-sa"
KEYRING="/usr/share/keyrings/dailymotion-archive.keyring.gpg"
DEBOOTSTRAPOPTS=("--arch" "$ARCH" "--variant=buildd" "$ DEBOOTSTRAPOPTS[@] " "--keyring=$KEYRING")
APTKEYRINGS=("$KEYRING")
EXTRAPACKAGES=("dailymotion-archive-keyring")

pbuilder is expected to be invoked with DIST, ARCH and optionally ROLE environment variables. Building the initial bases can be done like this:

for ARCH in i386 amd64; do
  for DIST in precise; do
    export ARCH
    export DIST
    pbuilder --create
  done
done

We don t create a base for each role. Instead, we use a D hook to add the appropriate source:

#!/bin/bash
[ -z "$ROLE" ]    
  cat >> /etc/apt/sources.list <<EOF
deb http://packages.dm.gg/dailymotion $ DIST -staging role/$ ROLE 
EOF
 
apt-get update

We ensure packages from our staging distribution are preferred over other packages by adding an /etc/apt/preferences file in a E hook:

#!/bin/bash
cat > /etc/apt/preferences <<EOF
Explanation: Dailymotion packages are of higher priority
Package: *
Pin: release o=Dailymotion
Pin-Priority: 900
EOF

We also use a C hook to get a shell in case there is an error. This is convenient to debug a problem:

#!/bin/bash
apt-get install -y --force-yes vim less
cd /tmp/buildd/*/debian/..
/bin/bash < /dev/tty > /dev/tty 2> /dev/tty

A manual build can be run with:

$ ARCH=amd64 DIST=precise ROLE=web pbuilder \
>         --build somepackage.dsc

Version numbering To avoid to apply complex rules to chose a version number for a package, we chose to treat everything as a backport, even in-house software. We use the following scheme: `X-Y~preciseZ+dmW`.

`X` is the upstream version⁵.

`Y` is the Debian version. If there is no Debian version, use 0.

`Z` is the Ubuntu backport version. Again, if such a version doesn t exist, use 0.

`W` is our version of the package. We increment it when we make a change to the packaging. This is the only number we are allowed to control. All the others are set by an upstream entity, unless it doesn t exist and in this case, you use 0.

Let s suppose you need to backport `wackadoodle`. It is available in a more recent version of Ubuntu as `1.4-3`. Your first backport will be `1.4-3~precise0+dm1`. After a change to the packaging, the version will be `1.4-3~precise0+dm2`. A new upstream version `1.5` is available and you need it. You will use `1.5-0~precise0+dm1`. Later, this new upstream version will be available in some version of Ubuntu as `1.5-3ubuntu1`. You will rebase your changes on this version and get `1.5-3ubuntu1~precise0+dm1`. When using Debian instead of Ubuntu, a compatible convention could be : `X-Y~bpo70+Z~dm+W`.

Uploading To upload a package, a common setup is the following workflow:

Upload the source package to an incoming directory.
reprepro will notice the source package, check its correctness (signature, distribution) and put it in the archive.
The builder will notice a new package needs to be built and build it.
Once the package is built, the builder will upload the result to the incoming directory.
reprepro will notice again the new binary package and integrate it in the archive.

This workflow has the disadvantage to have many moving pieces and to leave the user in the dark while the compilation is in progress. As an alternative, a simple script can be used to execute each step synchronously. The user can follow on their terminal that everything works as expected. Once we have the .changes file, the build script just issues the appropriate command to include the result in the archive:

$ reprepro -C main include precise-staging \
>      wackadoodle_1.4-3~precise0+dm4_amd64.changes

Happy hacking!

The gpg/ directory could be shared by several repositories.
We teached Debian Installer to work with our setup with an appropriate preseed file.
fpm-cookery is a convenient tool to write recipes for fpm, similar to Homebrew or a BSD port tree. It could be used to achieve the same goal.
sbuild is an alternative to pbuilder and is the official builder for both Debian and Ubuntu. Historically, pbuilder was more focused on developers needs.
For a Git snapshot, we use something like 1.4-git20130905+1-ae42dc1 which is a snapshot made after version 1.4 (use 0.0 if no version has ever been released) at the given date. The following 1 is to be able to package different snapshots at the same date while the hash is here in case you need to retrieve the exact snapshot.

18 March 2014

Vincent Bernat: EDNS client subnet support for BIND

To provide geolocation-aware answers with BIND, a common solution is to use a patch adding GeoIP support. A client can be directed to the closest (and hopefully fastest) web server:

view "FRANCE"  
     match-clients   geoip_cityDB_country_FR;  ;
     zone "example.com" in  
         type master;
         file "france.example.com.dns";
      ;
 ;
view "GERMANY"  
     match-clients   geoip_cityDB_country_DE;  ;
     zone "example.com" in  
         type master;
         file "germany.example.com.dns";
      ;
 ;
/* [...] */
view "DEFAULT"  
    zone "example.com" in  
        type master;
        file "example.com.dns";
     ;
 ;

However, an end user does not usually talk directly to authoritative servers. They proxy the query to a third-party recursor server which will query the authoritative server on their behalf. The recursor also caches the answer to be able to serve it directly to other clients. On most cases, we can still rely on the recursor GeoIP location to forward the client to the closest web server because it is located in the client s ISP network, as shown on the following schema: Query for www.example.com through an ISP recursor

Query for www.example.com through an ISP recursor

Juan is located in China and wants to know the IP address of www.example.com. She queries her ISP resolver.
The resolver asks the authoritative server for the answer.
Because the IP address of the resolver is located in China, the authoritative server decides to answer with the IP address of the web server located in Japan which is the closest one.
Juan can now enjoy short round-trips with the web server.

However, this is not the case when using a public recursor as provided by Google or OpenDNS. In this case, the IP address of the end client and the source IP address of the recursor may not share the same locality. For example, in the following schema, the authoritative server now thinks it is in relation with an European customer and answers with the IP address of the web server located in Europe: Query for www.example.com through an open recursor

Query for www.example.com through an open recursor

Moreover, caching makes the problem worse. To solve this problem, a new EDNS extension to expose the client subnet has been proposed. When using this extension, the recursor will provide the client subnet to the authoritative server for it to build an optimized reply. The subnet is vague enough to respect client s privacy but precise enough to be able to locate it. A patched version of dig allows one to make queries with this new extension:

$ geoiplookup 138.231.136.0
GeoIP Country Edition: FR, France
$ ./bin/dig/dig @dns-02.dailymotion.com www.dailymotion.com \
>     +client=138.231.136.0/24
; <<>> DiG 9.8.1-P1-geoip-1.3 <<>> @dns-02.dailymotion.com www.dailymotion.com +client=138.231.136.0/24
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 23312
;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; CLIENT-SUBNET: 138.231.136.0/24/24
;; QUESTION SECTION:
;www.dailymotion.com.           IN      A
;; ANSWER SECTION:
www.dailymotion.com.    600     IN      A       195.8.215.136
www.dailymotion.com.    600     IN      A       195.8.215.137
;; Query time: 20 msec
;; SERVER: 188.65.127.2#53(188.65.127.2)
;; WHEN: Sun Oct 20 15:44:47 2013
;; MSG SIZE  rcvd: 91
$ geoiplookup 195.8.215.136
GeoIP Country Edition: FR, France

In the above example, a client located in France gets a reply with two IP addresses located in France. If we now are an US client, we will get IP addresses located in the US:

$ geoiplookup 170.149.100.0
GeoIP Country Edition: US, United States
$ ./bin/dig/dig @dns-02.dailymotion.com www.dailymotion.com \
>     +client=170.149.100.0/24
; <<>> DiG 9.8.1-P1-geoip-1.3 <<>> @dns-02.dailymotion.com www.dailymotion.com +client=170.149.100.0/24
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 23187
;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; CLIENT-SUBNET: 170.149.100.0/24/24
;; QUESTION SECTION:
;www.dailymotion.com.           IN      A
;; ANSWER SECTION:
www.dailymotion.com.    600     IN      A       188.65.120.135
www.dailymotion.com.    600     IN      A       188.65.120.136
;; Query time: 18 msec
;; SERVER: 188.65.127.2#53(188.65.127.2)
;; WHEN: Sun Oct 20 15:47:22 2013
;; MSG SIZE  rcvd: 91
$ geoiplookup 188.65.120.135
GeoIP Country Edition: US, United States

The recursor is expected to cache the two different answers and only serve them if the client matches the appropriate subnet (the one confirmed in the answer from the authoritative server). With this new extension, the authoritative server knows that Juan is located in China and answers with the appropriate IP address: Query for www.example.com through an open recursor with client subnet

Query for www.example.com through an open recursor with client subnet

Not many authoritative servers support this extension (PowerDNS and gdnsd, as far as I know). At Dailymotion, we have built a patch for BIND. It only works when BIND is configured as an authoritative server and it doesn t expose no configuration knobs. Feel free to use it (at your own risk). Once installed, you need to register yourself to OpenDNS and to Google to receive queries with the extension enabled.

24 February 2014

Vincent Bernat: Coping with the TCP TIME-WAIT state on busy Linux servers

TL;DR: Do not enable net.ipv4.tcp_tw_recycle. The Linux kernel documentation is not very helpful about what net.ipv4.tcp_tw_recycle does:

Enable fast recycling TIME-WAIT sockets. Default value is 0. It should not be changed without advice/request of technical experts.

Its sibling, net.ipv4.tcp_tw_reuse is a little bit more documented but the language is about the same:

Allow to reuse TIME-WAIT sockets for new connections when it is safe from protocol viewpoint. Default value is 0. It should not be changed without advice/request of technical experts.

The mere result of this lack of documentation is that we find numerous tuning guides advising to set both these settings to 1 to reduce the number of entries in the TIME-WAIT state. However, as stated by tcp(7) manual page, the net.ipv4.tcp_tw_recycle option is quite problematic for public-facing servers as it won t handle connections from two different computers behind the same NAT device, which is a problem hard to detect and waiting to bite you:

Enable fast recycling of TIME-WAIT sockets. Enabling this option is not recommended since this causes problems when working with NAT (Network Address Translation).

I will provide here a more detailed explanation in the hope to teach people who are wrong on the Internet. xkcd illustration

As a sidenote, despite the use of ipv4 in its name, the net.ipv4.tcp_tw_recycle control also applies to IPv6. Also, keep in mind we are looking at the TCP stack of Linux. This is completely unrelated to Netfilter connection tracking which may be tweaked in other ways¹.

About TIME-WAIT state Let s rewind a bit and have a close look at this TIME-WAIT state. What is it? See the TCP state diagram below²:

Only the end closing the connection first will reach the TIME-WAIT state. The other end will follow a path which usually permits to quickly get rid of the connection. You can have a look at the current state of connections with

ss
-tan

$ ss -tan   head -5
LISTEN     0  511             *:80              *:*     
SYN-RECV   0  0     192.0.2.145:80    203.0.113.5:35449
SYN-RECV   0  0     192.0.2.145:80   203.0.113.27:53599
ESTAB      0  0     192.0.2.145:80   203.0.113.27:33605
TIME-WAIT  0  0     192.0.2.145:80   203.0.113.47:50685

Purpose There are two purposes for the TIME-WAIT state:

The most known one is to prevent delayed segments from one connection being accepted by a later connection relying on the same quadruplet (source address, source port, destination address, destination port). The sequence number also needs to be in a certain range to be accepted. This narrows a bit the problem but it still exists, especially on fast connections with large receive windows. RFC 1337 explains in details what happens when the TIME-WAIT state is deficient³. Here is an example of what could be avoided if the TIME-WAIT state wasn t shortened:

Duplicate segments accepted in another connection

The other purpose is to ensure the remote end has closed the connection. When the last ACK is lost, the remote end stays in the LAST-ACK state⁴. Without the TIME-WAIT state, a connection could be reopened while the remote end still thinks the previous connection is valid. When it receives a SYN segment (and the sequence number matches), it will answer with a RST as it is not expecting such a segment. The new connection will be aborted with an error:

RFC 793 requires the TIME-WAIT state to last twice the time of the MSL. On Linux, this duration is not tunable and is defined in include/net/tcp.h as one minute:

#define TCP_TIMEWAIT_LEN (60*HZ) /* how long to wait to destroy TIME-WAIT
                                  * state, about 60 seconds     */

There have been propositions to turn this into a tunable value but it has been refused on the ground the TIME-WAIT state is a good thing.

Problems Now, let s see why this state can be annoying on a server handling a lot of connections. There are three aspects of the problem:

the slot taken in the connection table preventing new connections of the same kind,

the memory occupied by the socket structure in the kernel, and

the additional CPU usage.

The result of `ss -tan state time-wait wc -l` is not a problem per se!

Connection table slot A connection in the TIME-WAIT state is kept for one minute in the connection table. This means, another connection with the same quadruplet (source address, source port, destination address, destination port) cannot exist. For a web server, the destination address and the destination port are likely to be constant. If your web server is behind a L7 load-balancer, the source address will also be constant. On Linux, the client port is by default allocated in a port range of about 30,000 ports (this can be changed by tuning net.ipv4.ip_local_port_range). This means that only 30,000 connections can be established between the web server and the load-balancer every minute, so about 500 connections per second. If the TIME-WAIT sockets are on the client side, such a situation is easy to detect. The call to connect() will return EADDRNOTAVAIL and the application will log some error message about that. On the server side, this is more complex as there is no log and no counter to rely on. In doubt, you should just try to come with something sensible to list the number of used quadruplets:

$ ss -tan 'sport = :80'   awk ' print $(NF)" "$(NF-1) '   \
>     sed 's/:[^ ]*//g'   sort   uniq -c
    696 10.24.2.30 10.33.1.64
   1881 10.24.2.30 10.33.1.65
   5314 10.24.2.30 10.33.1.66
   5293 10.24.2.30 10.33.1.67
   3387 10.24.2.30 10.33.1.68
   2663 10.24.2.30 10.33.1.69
   1129 10.24.2.30 10.33.1.70
  10536 10.24.2.30 10.33.1.73

The solution is more quadruplets⁵. This can be done in several ways (in the order of difficulty to setup):

use more client ports by setting net.ipv4.ip_local_port_range to a wider range,
use more server ports by asking the web server to listen to several additional ports (81, 82, 83, ),
use more client IP by configuring additional IP on the load balancer and use them in a round-robin fashion,
use more server IP by configuring additional IP on the web server⁶.

Of course, a last solution is to tweak net.ipv4.tcp_tw_reuse and net.ipv4.tcp_tw_recycle. Don t do that yet, we will cover those settings later.

Memory With many connections to handle, leaving a socket open for one additional minute may cost your server some memory. For example, if you want to handle about 10,000 new connections per second, you will have about 600,000 sockets in the TIME-WAIT state. How much memory does it represent? Not that much! First, from the application point of view, a TIME-WAIT socket does not consume any memory: the socket has been closed. In the kernel, a TIME-WAIT socket is present in three structures (for three different purposes):

A hash table of connections, named the TCP established hash table (despite containing connections in other states) is used to locate an existing connection, for example when receiving a new segment. Each bucket of this hash table contains both a list of connections in the TIME-WAIT state and a list of regular active connections. The size of the hash table depends on the system memory and is printed at boot:

$ dmesg   grep "TCP established hash table"
[    0.169348] TCP established hash table entries: 65536 (order: 8, 1048576 bytes)

It is possible to override it by specifying the number of entries on the kernel command line with the thash_entries parameter. Each element of the list of connections in the TIME-WAIT state is a struct tcp_timewait_sock, while the type for other states is struct tcp_sock⁷:

struct tcp_timewait_sock  
    struct inet_timewait_sock tw_sk;
    u32    tw_rcv_nxt;
    u32    tw_snd_nxt;
    u32    tw_rcv_wnd;
    u32    tw_ts_offset;
    u32    tw_ts_recent;
    long   tw_ts_recent_stamp;
 ;
struct inet_timewait_sock  
    struct sock_common  __tw_common;
    int                     tw_timeout;
    volatile unsigned char  tw_substate;
    unsigned char           tw_rcv_wscale;
    __be16 tw_sport;
    unsigned int tw_ipv6only     : 1,
                 tw_transparent  : 1,
                 tw_pad          : 6,
                 tw_tos          : 8,
                 tw_ipv6_offset  : 16;
    unsigned long            tw_ttd;
    struct inet_bind_bucket *tw_tb;
    struct hlist_node        tw_death_node;
 ;

A set of lists of connections, called the death row , is used to expire the connections in the TIME-WAIT state. They are ordered by how much time left before expiration. It uses the same memory space as for the entries in the hash table of connections. This is the struct hlist_node tw_death_node member of struct inet_timewait_sock.
A hash table of bound ports, holding the locally bound ports and the associated parameters, is used to determine if it is safe to listen to a given port or to find a free port in the case of dynamic bind. The size of this hash table is the same as the size of the hash table of connections:
```
$ dmesg   grep "TCP bind hash table"
[    0.169962] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
```
Each element is a struct inet_bind_socket. There is one element for each locally bound port. A TIME-WAIT connection to a web server is locally bound to the port 80 and shares the same entry as its sibling TIME-WAIT connections. On the other hand, a connection to a remote service is locally bound to some random port and does not share its entry.

So, we are only concerned by the space occupied by

struct
tcp_timewait_sock

and struct inet_bind_socket. There is one

struct
tcp_timewait_sock

for each connection in the TIME-WAIT state, inbound or outbound. There is one dedicated struct inet_bind_socket for each outbound connection and none for an inbound connection. A struct tcp_timewait_sock is only 168 bytes while a

struct
inet_bind_socket

is 48 bytes:

$ sudo apt-get install linux-image-$(uname -r)-dbg
[...]
$ gdb /usr/lib/debug/boot/vmlinux-$(uname -r)
(gdb) print sizeof(struct tcp_timewait_sock)
 $1 = 168
(gdb) print sizeof(struct tcp_sock)
 $2 = 1776
(gdb) print sizeof(struct inet_bind_bucket)
 $3 = 48

So, if you have about 40,000 inbound connections in the TIME-WAIT state, it should eat less than 10MB of memory. If you have about 40,000 outbound connections in the TIME-WAIT state, you need to account for 2.5MB of additional memory. Let s check that by looking at the output of slabtop. Here is the result on a server with about 50,000 connections in the TIME-WAIT state, 45,000 of which are outbound connections:

$ sudo slabtop -o   grep -E '(^  OBJS tw_sock_TCP tcp_bind_bucket)'
  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
 50955  49725  97%    0.25K   3397       15     13588K tw_sock_TCP            
 44840  36556  81%    0.06K    760       59      3040K tcp_bind_bucket

There is nothing to change here: the memory used by TIME-WAIT connections is really small. If your server need to handle thousands of new connections per second, you need far more memory to be able to efficiently push data to clients. The overhead of TIME-WAIT connections is negligible.

CPU On the CPU side, searching for a free local port can be a bit expensive. The work is done by the `inet_csk_get_port()` function which uses a lock and iterate on locally bound ports until a free port is found. A large number of entries in this hash table is usually not a problem if you have a lot of outbound connections in the `TIME-WAIT` state (like ephemeral connections to a memcached server): the connections usually share the same profile, the function will quickly find a free port as it iterates on them sequentially.

Summary The universal solution is to increase the number of possible quadruplets by using, for example, more server ports. This will allow you to not exhaust the possible connections with `TIME-WAIT` entries. On the server side, do not enable `net.ipv4.tcp_tw_recycle` unless you are pretty sure you will never have NAT devices in the mix. Enabling `net.ipv4.tcp_tw_reuse` is useless for incoming connections. On the client side, enabling `net.ipv4.tcp_tw_reuse` is another almost-safe solution. Enabling `net.ipv4.tcp_tw_recycle` in addition to `net.ipv4.tcp_tw_reuse` is mostly useless. And a final quote by W. Richard Stevens, in Unix Network Programming:
The `TIME_WAIT` state is our friend and is there to help us (i.e., to let old duplicate segments expire in the network). Instead of trying to avoid the state, we should understand it.

Notably, fiddling with `net.netfilter.nf_conntrack_tcp_timeout_time_wait` won t change anything on how the TCP stack will handle the `TIME-WAIT` state.

This diagram is licensed under the LaTeX Project Public License 1.3. The original file is available on this page.

The first work-around proposed in RFC 1337 is to ignore RST segments in the `TIME-WAIT` state. This behaviour is controlled by `net.ipv4.rfc1337` which is not enabled by default on Linux because this is not a complete solution to the problem described in the RFC.

While in the `LAST-ACK` state, a connection will retransmit the last FIN segment until it gets the expected ACK segment. Therfore, it is unlikely we stay long in this state.

On the client side, older kernels also have to find a free local tuple (source address and source port) for each outgoing connection. Increasing the number of server ports or IP won t help in this case. Linux 3.2 is recent enough to be able to share the same local tuple for different destinations. Thanks to Willy Tarreau for his insight on this aspect.

This last solution may seem a bit dumb since you could just use more ports but some servers are not able to be configured this way. The before last solution can also be quite cumbersome to setup, depending on the load-balancing software, but uses less IP than the last solution.

The use of a dedicated memory structure for sockets in the `TIME-WAIT` is here since Linux 2.6.14. The `struct sock_common` structure is a bit more verbose and I won t copy it here.

When the server closes the connection first, it gets the `TIME-WAIT` state while the client will consider the corresponding quadruplet free and hence may reuse it for a new connection.

1 January 2014

Vincent Bernat: Testing infrastructure with serverspec

Checking if your servers are configured correctly can be done with IT automation tools like Puppet, Chef, Ansible or Salt. They allow an administrator to specify a target configuration and ensure it is applied. They can also run in a dry-run mode and report servers not matching the expected configuration. On the other hand, serverspec is a tool to bring the well known RSpec, a testing tool for the Ruby programming language frequently used for test-driven development, to the infrastructure world. It can be used to remotely test server state through an SSH connection. Why one would use such an additional tool? Many things are easier to express with a test than with a configuration change, like for example checking that a service is correctly installed by checking it is listening to some port.

Getting started Good knowledge of Ruby may help but is not a prerequisite to the use of serverspec. Writing tests feels like writing what we expect in plain English. If you think you need to know more about Ruby, here are two short resources to get started:

serverspec s homepage contains a short and concise tutorial on how to get started. Please, read it. As a first illustration, here is a test checking a service is correctly listening on port 80:

describe port(80) do
  it   should be_listening  
end

The following test will spot servers still running with Debian Squeeze instead of Debian Wheezy:

describe command("lsb_release -d") do
  it   should return_stdout /wheezy/  
end

Conditional tests are also possible. For example, we want to check the miimon parameter of bond0, but only when the interface is present:

has_bond0 = file('/sys/class/net/bond0').directory?
# miimon should be set to something other than 0, otherwise, no checks
# are performed.
describe file("/sys/class/net/bond0/bonding/miimon"), :if => has_bond0 do
  it   should be_file  
  its(:content)   should_not eq "0\n"  
end

serverspec comes with a complete documentation of available resource types (like port and command) that can be used after the keyword describe. When a test is too complex to be expressed with simple expectations, it can be specified with arbitrary commands. In the below example, we check if memcached is configured to use almost all the available system memory:

# We want memcached to use almost all memory. With a 2GB margin.
describe "memcached" do
  it "should use almost all memory" do
    total = command("vmstat -s   head -1").stdout #  
    total = /\d+/.match(total)[0].to_i
    total /= 1024
    args = process("memcached").args #  
    memcached = /-m (\d+)/.match(args)[1].to_i
    (total - memcached).should be > 0
    (total - memcached).should be < 2000
  end
end

A bit more arcane, but still understandable: we combine arbitrary shell commands (in ) and use of other serverspec resource types (in ).

Advanced use Out of the box, serverspec provides a strong fundation to build a compliance tool to be run on all systems. It comes with some useful advanced tips, like sharing tests among similar hosts or executing several tests in parallel. I have setup a GitHub repository to be used as a template to get the following features:

assign roles to servers and tests to roles;

parallel execution;

report generation & viewer.

Host classification By default, serverspec-init generates a template where each host has its own directory with its unique set of tests. serverspec only handles test execution on remote hosts: the test execution flow (which tests are executed on which servers) is delegated to some Rakefile¹. Instead of extracting the list of hosts to test from a directory hiearchy, we can extract it from a file (or from an LDAP server or from any source) and attach a set of roles to each of them:

hosts = File.foreach("hosts")
  .map    line  line.strip  
  .map do  host 
   
    :name => host.strip,
    :roles => roles(host.strip),
   
end

The roles() function should return a list of roles for a given hostname. It could be something as simple as this:

def roles(host)
  roles = [ "all" ]
  case host
  when /^web-/
    roles << "web"
  when /^memc-/
    roles << "memcache"
  when /^lb-/
    roles << "lb"
  when /^proxy-/
    roles << "proxy"
  end
  roles
end

In the snippet below, we create a task for each server as well as a server:all task that will execute the tests for all hosts (in ). Pay attention, in , at how we attach the roles to each server.

namespace :server do
  desc "Run serverspec to all hosts"
  task :all => hosts.map    h  h[:name]   #  
  hosts.each do  host 
    desc "Run serverspec to host # host[:name] "
    ServerspecTask.new(host[:name].to_sym) do  t 
      t.target = host[:name]
      #  : Build the list of tests to execute from server roles
      t.pattern = './spec/ ' + host[:roles].join(",") + ' /*_spec.rb'
    end
  end
end

You can check the list of tasks created:

$ rake -T
rake check:server:all      # Run serverspec to all hosts
rake check:server:web-10   # Run serverspec to host web-10
rake check:server:web-11   # Run serverspec to host web-11
rake check:server:web-12   # Run serverspec to host web-12

Then, you need to modify spec/spec_helper.rb to tell serverspec to fetch the host to test from the environment variable TARGET_HOST instead of extracting it from the spec file name.

Parallel execution By default, each task is executed when the previous one has finished. With many hosts, this can take some time. rake provides the -j flag to specify the number of tasks to be executed in parallel and the -m flag to apply parallelism to all tasks:

$ rake -j 10 -m check:server:all

Reports rspec is invoked for each host. Therefore, the output is something like this:

$ rake spec
env TARGET_HOST=web-10 /usr/bin/ruby -S rspec spec/web/apache2_spec.rb spec/all/debian_spec.rb
......
Finished in 0.99715 seconds
6 examples, 0 failures
env TARGET_HOST=web-11 /usr/bin/ruby -S rspec spec/web/apache2_spec.rb spec/all/debian_spec.rb
......
Finished in 1.45411 seconds
6 examples, 0 failures

This does not scale well if you have dozens or hundreds of hosts to test. Moreover, the output is mangled with parallel execution. Fortunately, rspec comes with the ability to save results in JSON format. Those per-host results can then be consolidated into a single JSON file. All this can be done in the Rakefile:

For each task, set rspec_opts to --format json --out ./reports/current/# target .json. This is done automatically by the subclass ServerspecTask which also handles passing the hostname in an environment variable and a more concise and colored output.
Add a task to collect the generated JSON files into a single report. The test source code is also embedded in the report to make it self-sufficient. Moreover, this task is executed automatically by adding it as a dependency of the last serverspec-related task.

Have a look at the complete Rakefile for more details on how this is done. A very simple web-based viewer can handle those reports². It shows the test results as a matrix with failed tests in red: Report viewer example

Clicking on any test will display the necessary information to troubleshoot errors, including the test short description, the complete test code, the expectation message and the backtrace: Report viewer showing detailed error

I hope this additional layer will help making serverspec another feather in the IT cap, between an automation tool and a supervision tool.

A Rakefile is a Makefile where tasks and their dependencies are described in plain Ruby. rake will execute them in the appropriate order.
The viewer is available in the GitHub repository in the viewer/ directory.

11 November 2013

Vincent Bernat: Snimpy: SNMP & Python

While quite old fashioned, SNMP is still an ubiquitous protocol supported by most network equipments. It comes handy to expose various metrics, like network interface counters, to be gathered for the purpose of monitoring. It can also be used to retrieve and modify equipments configuration. Variables exposed by SNMP agents (servers) are organized inside a Management Information Base (MIB) which is a hierarchical database¹. Each entry is identified by an OID. Querying a specific OID allows a manager (client) to get the value of an associated variable. For example, one common MIB-module is IF-MIB defined in RFC 2863. It contains objects used to manage network interfaces. One of them is ifTable whose rows are representing agent s logical network interfaces. Each row will expose the interface name, characteristics and various associated counters.

ifIndex	ifDescr	ifPhysAddress	ifOperStatus	ifOutOctets
1	lo		up	545721741
2	eth0	0:18:f3:3:4e:4	up	78875421
3	eth1	0:18:f3:3:4e:5	down	0

ifTable is indexed by its first column ifIndex. If you want to get the operational status of the second interface, you need to query IF-MIB::ifOperStatus.2 which is translated to OID .1.3.6.1.2.1.2.2.1.8.2 using information provided by the MIB definition.

Scripting SNMP An SNMP agent can deliver a lot of interesting information:

interface counters with IF-MIB,
IP addresses with IP-MIB,
routing tables with IP-FORWARD-MIB,
inventory with ENTITY-MIB,
neighbors with LLDP-MIB.

You can gather those information manually with tools like snmpget and snmpwalk:

$ snmpwalk -v 2c -c public localhost IF-MIB::ifDescr      
IF-MIB::ifDescr.1 = STRING: lo
IF-MIB::ifDescr.2 = STRING: eth0
IF-MIB::ifDescr.3 = STRING: eth1

However, building robust scripts with them is quite challenging. For example, if you wanted to get the descriptions of all active interfaces as well as the total number of octets transmitted, you could do something like that:

#!/bin/sh
set -e
host="$ 1:-localhost "
community="$ 2:-public "
args="-v2c -c $community $host"
for idx in $(snmpwalk -Ov -OQ $args IF-MIB::ifIndex); do
    descr=$(snmpget -Ov -OQ $args IF-MIB::ifDescr.$idx)
    oper=$(snmpget -Ov -OQ $args IF-MIB::ifOperStatus.$idx)
    in=$(snmpget -Ov -OQ $args IF-MIB::ifInOctets.$idx)
    out=$(snmpget -Ov -OQ $args IF-MIB::ifOutOctets.$idx)
    [ x"$descr" != x"lo" ]   continue
    [ x"$oper" = x"up" ]   continue
    echo $descr $in $out
done

Hopefully, SNMP bindings in various languages are pretty common. For example, Net-SNMP ships with a Python binding:

import argparse
import netsnmp
parser = argparse.ArgumentParser()
parser.add_argument("host", default="localhost", nargs="?",
                    help="Agent to retrieve variables from")
parser.add_argument("community", default="public", nargs="?",
                    help="Community to query the agent")
options = parser.parse_args()
args =  
    "Version": 2,
    "DestHost": options.host,
    "Community": options.community
 
for idx in netsnmp.snmpwalk(netsnmp.Varbind("IF-MIB::ifIndex"),
                            **args):
    descr, oper, cin, cout = netsnmp.snmpget(
        netsnmp.Varbind("IF-MIB::ifDescr", idx),
        netsnmp.Varbind("IF-MIB::ifOperStatus", idx),
        netsnmp.Varbind("IF-MIB::ifInOctets", idx),
        netsnmp.Varbind("IF-MIB::ifOutOctets", idx),
        **args)
    assert(descr is not None and
           cin is not None and
           cout is not None) #  
    if descr == "lo":
        continue
    if oper != "1": #  
        continue
    print("     ".format(descr, cin, cout))

This binding is quite primitive and has several drawbacks:

It exports everything as strings. See .
Error handling is just deficient. If you mispell something, like a variable name, you ll get snmp_build: unknown failure on the standard error. No exception. If a variable does not exist, you ll get None instead. See .

This inability to sanely handle failures makes this binding quite dangerous to use in scripts. Imagine making important modifications on the basis of the returned values. If you forget to check against None, your script may cause havoc!

Snimpy Because I didn t find any reliable Python binding for SNMP, I decided to write Snimpy with two goals in mind:

Leverage information contained in MIBs to provide a pythonic interface.
Any error condition should raise an exception.

Here is how the previous script could be written:

#!/usr/bin/env snimpy
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("host", default="localhost", nargs="?",
                    help="Agent to retrieve variables from")
parser.add_argument("community", default="public", nargs="?",
                    help="Community to query the agent")
options = parser.parse_args()
m = M(options.host, options.community, 2)
load("IF-MIB")
for idx in m.ifDescr:
    if m.ifDescr[idx] == "lo":
        continue
    if m.ifOperStatus[idx] != "up":
        continue
    print("     ".format(m.ifDescr[idx],
                            m.ifInOctets[idx],
                            m.ifOutOctets[idx]))

You can also use a list comprehension:

load("IF-MIB")
print("\n".join([ "     ".format(m.ifDescr[idx],
                                    m.ifInOctets[idx],
                                    m.ifOutOctets[idx])
                  for idx in m.ifDescr
                  if m.ifDescr[idx] != "lo"
                  and m.ifOperStatus[idx] == "up" ]))

Here is another simple example to get the routing database from the agent:

load("IP-FORWARD-MIB")
m=M("localhost", "public", 2)
routes = m.ipCidrRouteNextHop
for x in routes:
    net, netmask, tos, src = x
    print(" :>15s / :<15s  via  :<15s  src  :<15s ".format(
        net, netmask, routes[x], src))

IP-FORWARD-MIB::ipCidrRouteNextHop is a more complex table with a compound index. Despite this, querying the table still seems natural. Have a look at Snimpy s documentation for more information. Behind the hood, SNMP requests are handled by PySNMP and MIB parsing is done with libsmi². Snimpy supports both Python 2, Python 3 and Pypy.

A MIB is defined using a subset of ASN.1 called SMI. However, it is not uncommon to refer to the definition as a MIB too.
Unfortunately, there is currently no robust SMI parser written in pure Python. For example, PySNMP relies on smidump which comes with libsmi. Snimpy uses a custom CFFI wrapper around libsmi.

6 September 2013

Vincent Bernat: High availability with ExaBGP

When it comes to provide redundant services, several options are available:

The service can be hosted behind a set of load-balancers. They will detect any faulty node. However, you need to ensure that this new layer is also fault-tolerant.
The nodes providing the service can rely on IP failover to share a set of IP using protocols like VRRP¹ or CARP. The IP address of a faulty node will be assigned to another node. This requires all nodes to be part of the same IP subnet.
The clients of the service can ask a third-party for available nodes. Usually, this is achieved through round-robin DNS where only working nodes are advertised in a DNS record. The failover can be quite long because of caches.

A common setup is a combination of those solutions: web servers are behind a couple of load-balancers achieving both redundancy and load-balancing. The load-balancers use VRRP to ensure redundancy. The whole setup is replicated to another datacenter and round-robin DNS is used to ensure both redundancy and load-balacing of the datacenters. There is a fourth option which is similar to VRRP but relies on dynamic routing and therefore is not limited to nodes in the same subnet:

The nodes advertise their availability with BGP to announce the set of service IP addresses they are able to serve. Each address is weighted such that IP addresses are balanced among the available nodes.

We will explore how to implement this fourth option using ExaBGP, the BGP swiss army knife of networking, in a small lab based on KVM. You can grab the complete lab from GitHub. ExaBGP 3.2.5 is needed to run this lab.

Environment We will be working in the following (IPv6-only) environment:

BGP configuration BGP is enabled on ER2 and ER3 to exchange routes with peers and transits (only R1 for this lab). BIRD is used as a BGP daemon. Its configuration is pretty basic. Here is a fragment:

router id 1.1.1.2;
protocol static NETS   #  
  import all;
  export none;
  route 2001:db8::/40 reject;
 
protocol bgp R1   #  
  import all;
  export where proto = "NETS";
  local as 64496;
  neighbor 2001:db8:1000::1 as 64511;
 
protocol bgp ER3   #  
  import all;
  export all;
  next hop self;
  local as 64496;
  neighbor 2001:db8:1::3 as 64496;

First, in , we declare the routes that we want to export to our peers. We don t try to summarize them from the IGP. We just unconditonnaly export the networks that we own. Then, in , R1 is defined as a neighbor and we export the static route we declared previously. We import any route that R1 is willing to send us. In , we share everything we know with our pal, ER3, using internal BGP.

OSPF configuration OSPF will distribute routes inside the AS. It is enabled on ER2, ER3, DR6, DR7 and DR8. For example, here is the relevant part of the configuration of DR6:

router id 1.1.1.6;
protocol kernel  
   persist;
   import none;
   export all;
 
protocol ospf INTERNAL  
  import all;
  export none;
  area 0.0.0.0  
    networks  
      2001:db8:1::/64;
      2001:db8:6::/64;
     ;
    interface "eth0";
    interface "eth1"   stub yes;  ;
   ;

ER2 and ER3 inject a default route into OSPF:

protocol static DEFAULT  
  import all;
  export none;
  route ::/0 via 2001:db8:1000::1;
 
filter default_route  
  if proto = "DEFAULT" then accept;
  reject;
 
protocol ospf INTERNAL  
  import all;
  export filter default_route;
  area 0.0.0.0  
    networks  
      2001:db8:1::/64;
     ;
    interface "eth1";
   ;

Web nodes The web nodes are pretty basic. They have a default static route to the nearest router and that s all. The interesting thing here is that they are each on a separate IP subnet: we cannot share an IP using VRRP². Why are those web servers on different subnets? Maybe they are not in the same datacenter or maybe your network architecture is using a routed access layer. Let s see how to use BGP to enable redundancy of those web nodes.

Redundancy with ExaBGP ExaBGP is a convenient tool to plug scripts into BGP. They can then receive and advertise routes. ExaBGP does the hard work of speaking BGP with your routers. The scripts just have to read routes from standard input or advertise them on standard output.

The big picture Here is the big picture: Let s explain it step by step:

Three IP addresses will be allocated for our web service: `2001:db8:30::1`, `2001:db8:30::2` and `2001:db8:30::3`. Those are distinct from the real IP addresses of W1, W2 and W3.

Each web node will advertise all of them to the route servers we added in the network. I will talk more about those route servers later.

Each route comes with a metric to help the route server choose where it should be routed. We choose the metrics such that each IP address will be routed to a distinct web node (unless there is a problem).

The route servers (which are not routers) will then advertise the best routes they learned to all the routers in our network. This is still done using BGP.

Now, for a given IP address, each router knows to which web node the traffic should be directed.

Here are the respective metrics announced routes for W1, W2 and W3 when everything works as expected:

Route W1 W2 W3 Best Backup

2001:db8:30::1 102 101 100 W3 W2

2001:db8:30::2 101 100 102 W2 W1

2001:db8:30::3 100 102 101 W1 W3

Route	W1	W2	W3	Best	Backup
2001:db8:30::1	102	101	100	W3	W2
2001:db8:30::2	101	100	102	W2	W1
2001:db8:30::3	100	102	101	W1	W3

ExaBGP configuration The configuration of ExaBGP is quite simple:

group rs  
  neighbor 2001:db8:1::4  
    router-id 1.1.1.11;
    local-address 2001:db8:6::11;
    local-as 65001;
    peer-as 65002;
   
  neighbor 2001:db8:8::5  
    router-id 1.1.1.11;
    local-address 2001:db8:6::11;
    local-as 65001;
    peer-as 65002;
   
  process watch-nginx  
      run /usr/bin/python /lab/healthcheck.py -s --config /lab/healthcheck-nginx.conf --start-ip 0;

A script is entailed to check if the service (an nginx web server) is up and running and advertise the appropriate IP addresses to the two declared route servers. If we run the script manually, we can see the advertised routes:

$ python /lab/healthcheck.py --config /lab/healthcheck-nginx.conf --start-ip 0
INFO[healthcheck] send announces for UP state to ExaBGP
announce route 2001:db8:30::3/128 next-hop self med 100
announce route 2001:db8:30::2/128 next-hop self med 101
announce route 2001:db8:30::1/128 next-hop self med 102
[...]
WARNING[healthcheck] Check command was unsuccessful: 7
INFO[healthcheck] Output of check command:  curl: (7) Failed connect to ip6-localhost:80; Connection refused
WARNING[healthcheck] Check command was unsuccessful: 7
INFO[healthcheck] Output of check command:  curl: (7) Failed connect to ip6-localhost:80; Connection refused
WARNING[healthcheck] Check command was unsuccessful: 7
INFO[healthcheck] Output of check command:  curl: (7) Failed connect to ip6-localhost:80; Connection refused
INFO[healthcheck] send announces for DOWN state to ExaBGP
announce route 2001:db8:30::3/128 next-hop self med 1000
announce route 2001:db8:30::2/128 next-hop self med 1001
announce route 2001:db8:30::1/128 next-hop self med 1002

When the service becomes unresponsive, the healthcheck script detect the situation and retry several times before acknowledging that the service is dead. Then, the IP addresses are advertised with higher metrics and the service will be routed to another node (the one advertising 2001:db8:30::3/128 with metric 101). This healthcheck script is now part of ExaBGP.

Route servers We could have connected our ExaBGP servers directly to each router. However, if you have 20 routers and 10 web servers, you now have to manage a mesh of 200 sessions. The route servers are here for three purposes:

Reduce the number of BGP sessions (from 200 to 60) between equipments (less configuration, less errors).

Avoid modifying the configuration on routers each time a new service is added.

Separate the routing decision (the route servers) from the routing process (the routers).

You may also ask yourself: why not use OSPF? . Good question!

OSPF could be enabled on each web node and the IP addresses advertised using this protocol. However, OSPF has several drawbacks: it does not scale, there are restrictions on the allowed topologies, it is difficult to filter routes inside OSPF and a misconfiguration will likely impact the whole network. Therefore, it is considered a good practice to limit OSPF to network equipments.

The routes learned by the route servers could be injected into OSPF. On paper, OSPF has a next-hop field to provide an explicit next-hop. This would be handy as we wouldn t have to configure adjacencies with each router. However, I have absolutely no idea how to inject BGP next-hop into OSPF next-hop. What happens is that BGP next-hop is resolved locally using OSPF routes. For example, if we inject BGP routes into OSPF from RS4, RS4 will know the appropriate next-hop but other routers will route the traffic to RS4.

Let s look at how to configure our route servers. RS4 will use BIRD while RS5 will use Quagga. Using two different implementations will help with resiliency by avoiding a bug to hit the two route servers at the same time.

Bird configuration There are two sides for BGP configuration: the BGP sessions with the ExaBGP nodes and the ones with the routers. Here is the configuration for the later:

template bgp INFRABGP  
  export all;
  import none;
  local as 65002;
  rs client;
 
protocol bgp ER2 from INFRABGP  
  neighbor 2001:db8:1::2 as 65003;
 
protocol bgp ER3 from INFRABGP  
  neighbor 2001:db8:1::3 as 65003;
 
protocol bgp DR6 from INFRABGP  
  neighbor 2001:db8:1::6 as 65003;
 
protocol bgp DR7 from INFRABGP  
  neighbor 2001:db8:1::7 as 65003;
 
protocol bgp DR8 from INFRABGP  
  neighbor 2001:db8:1::8 as 65003;

The AS number used by our route server is 65002 while the AS number used for routers is 65003 (the AS numbers for web nodes will be 65001). They are reserved for private use by RFC 6996. All routes known by the route server are exported to the routers but no routes are accepted from them. Let s have a look at the other side:

# Only import loopback IPs
filter only_loopbacks   #  
  if net ~ [ 2001:db8:30::/64 128,128  ] then accept;
  reject;
 
# General template for an EXABGP node
template bgp EXABGP  
  local as 65002;
  import filter only_loopbacks; #  
  export none;
  route limit 10; #  
  rs client;
  hold time 6; #  
  multihop 10;
  igp table internal;
 
protocol bgp W1 from EXABGP  
  neighbor 2001:db8:6::11 as 65001;
 
protocol bgp W2 from EXABGP  
  neighbor 2001:db8:7::12 as 65001;
 
protocol bgp W3 from EXABGP  
  neighbor 2001:db8:8::13 as 65001;

To ensure separation of concerns, we are being a bit more picky. With and , we only accept loopback addresses and only if they are contained in the subnet that we reserved for this use. No server should be able to inject arbitrary addresses into our network. With , we also limit the number of routes that a server can advertise. With , we reduce the hold time from 240 to 6. It means that after 6 seconds, the peer is considered dead. This is quite important to be able to recover quickly from a dead machine. The minimal value is 3. We could have use a similar setting with the session with the routers.

Quagga configuration Quagga s configuration is a bit more verbose but should be strictly equivalent:

router bgp 65002 view EXABGP
 bgp router-id 1.1.1.5
 bgp log-neighbor-changes
 no bgp default ipv4-unicast
 neighbor R peer-group
 neighbor R remote-as 65003
 neighbor R ebgp-multihop 10
 neighbor EXABGP peer-group
 neighbor EXABGP remote-as 65001
 neighbor EXABGP ebgp-multihop 10
 neighbor EXABGP timers 2 6
!
 address-family ipv6
 neighbor R activate
 neighbor R soft-reconfiguration inbound
 neighbor R route-server-client
 neighbor R route-map R-IMPORT import
 neighbor R route-map R-EXPORT export
 neighbor 2001:db8:1::2 peer-group R
 neighbor 2001:db8:1::3 peer-group R
 neighbor 2001:db8:1::6 peer-group R
 neighbor 2001:db8:1::7 peer-group R
 neighbor 2001:db8:1::8 peer-group R
 neighbor EXABGP activate
 neighbor EXABGP soft-reconfiguration inbound
 neighbor EXABGP maximum-prefix 10
 neighbor EXABGP route-server-client
 neighbor EXABGP route-map RSCLIENT-IMPORT import
 neighbor EXABGP route-map RSCLIENT-EXPORT export
 neighbor 2001:db8:6::11 peer-group EXABGP
 neighbor 2001:db8:7::12 peer-group EXABGP
 neighbor 2001:db8:8::13 peer-group EXABGP
 exit-address-family
!
ipv6 prefix-list LOOPBACKS seq 5 permit 2001:db8:30::/64 ge 128 le 128
ipv6 prefix-list LOOPBACKS seq 10 deny any
!
route-map RSCLIENT-IMPORT deny 10
!
route-map RSCLIENT-EXPORT permit 10
  match ipv6 address prefix-list LOOPBACKS
!
route-map R-IMPORT permit 10
!
route-map R-EXPORT deny 10
!

The view is here to not install routes in the kernel³.

Routers Configuring BIRD to receive routes from route servers is straightforward:

# BGP with route servers
protocol bgp RS4  
  import all;
  export none;
  local as 65003;
  neighbor 2001:db8:1::4 as 65002;
  gateway recursive;
 
protocol bgp RS5  
  import all;
  export none;
  local as 65003;
  neighbor 2001:db8:8::5 as 65002;
  multihop 4;
  gateway recursive;

It is important to set gateway recursive because most of the time, the next-hop is not reachable directly. In this case, by default, BIRD will use the IP address of the advertising router (the route servers).

Testing Let s check that everything works as expected. Here is the view from RS5:

# show ipv6  bgp  
BGP table version is 0, local router ID is 1.1.1.5
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale, R Removed
Origin codes: i - IGP, e - EGP, ? - incomplete
   Network          Next Hop            Metric LocPrf Weight Path
*  2001:db8:30::1/128
                    2001:db8:6::11         102             0 65001 i
*>                  2001:db8:8::13         100             0 65001 i
*                   2001:db8:7::12         101             0 65001 i
*  2001:db8:30::2/128
                    2001:db8:6::11         101             0 65001 i
*                   2001:db8:8::13         102             0 65001 i
*>                  2001:db8:7::12         100             0 65001 i
*> 2001:db8:30::3/128
                    2001:db8:6::11         100             0 65001 i
*                   2001:db8:8::13         101             0 65001 i
*                   2001:db8:7::12         102             0 65001 i
Total number of prefixes 3

For example, traffic to 2001:db8:30::2 should be routed through 2001:db8:7::12 (which is W2). Other IP are affected to W1 and W3. RS4 should see the same thing⁴:

$ birdc6 show route 
BIRD 1.3.11 ready.
2001:db8:30::1/128 [W3 22:07 from 2001:db8:8::13] * (100/20) [AS65001i]
                   [W1 23:34 from 2001:db8:6::11] (100/20) [AS65001i]
                   [W2 22:07 from 2001:db8:7::12] (100/20) [AS65001i]
2001:db8:30::2/128 [W2 22:07 from 2001:db8:7::12] * (100/20) [AS65001i]
                   [W1 23:34 from 2001:db8:6::11] (100/20) [AS65001i]
                   [W3 22:07 from 2001:db8:8::13] (100/20) [AS65001i]
2001:db8:30::3/128 [W1 23:34 from 2001:db8:6::11] * (100/20) [AS65001i]
                   [W3 22:07 from 2001:db8:8::13] (100/20) [AS65001i]
                   [W2 22:07 from 2001:db8:7::12] (100/20) [AS65001i]

Let s have a look at DR6:

$ birdc6 show route
2001:db8:30::1/128 via fe80::5054:56ff:fe6e:98a6 on eth0 * (100/20) [AS65001i]
                   via fe80::5054:56ff:fe6e:98a6 on eth0 (100/20) [AS65001i]
2001:db8:30::2/128 via fe80::5054:60ff:fe02:3681 on eth0 * (100/20) [AS65001i]
                   via fe80::5054:60ff:fe02:3681 on eth0 (100/20) [AS65001i]
2001:db8:30::3/128 via 2001:db8:6::11 on eth1 * (100/10) [AS65001i]
                   via 2001:db8:6::11 on eth1 (100/10) [AS65001i]

So, 2001:db8:30::3 is owned by W1 which is behind DR6 while the two others are in another part of the network and will be reached through the appropriate link-local addresses learned by OSPF. Let s kill W1 by stopping nginx. A few seconds later, DR6 learns the new routes:

$ birdc6 show route
2001:db8:30::1/128 via fe80::5054:56ff:fe6e:98a6 on eth0 * (100/20) [AS65001i]
                   via fe80::5054:56ff:fe6e:98a6 on eth0 (100/20) [AS65001i]
2001:db8:30::2/128 via fe80::5054:60ff:fe02:3681 on eth0 * (100/20) [AS65001i]
                   via fe80::5054:60ff:fe02:3681 on eth0 (100/20) [AS65001i]
2001:db8:30::3/128 via fe80::5054:56ff:fe6e:98a6 on eth0 * (100/20) [AS65001i]
                   via fe80::5054:56ff:fe6e:98a6 on eth0 (100/20) [AS65001i]

Demo For a demo, have a look at the following video (it is also available as an Ogg Theora video).

The primary target of VRRP is to make a default gateway for an IP network highly available by exposing a virtual router which is an abstract representation of multiple routers. The virtual IP address is held by the master router and any backup router can be promoted to master in case of a failure. In practice, VRRP can be used to achieve high availability of services too.

However, you could deploy an L2 overlay network using, for example, VXLAN and use VRRP on it.

I have been told on the Quagga mailing-list that such a setup is quite uncommon and that it would be better to not use views but use `--no_kernel` flag for `bgpd` instead. You may want to look at the whole thread for more details.

The output of `birdc6` shows both the next-hop as advertised in BGP and the resolved next-hop if the route have to be exported into another protocol. I have simplified the output to avoid confusion.

25 August 2013

Vincent Bernat: Boilerplate for autotools-based C project

When starting a new HTML project, a common base is to use HTML5 Boilerplate which helps by setting up the essential bits. Such a template is quite useful for both beginners and experienced developers as it is kept up-to-date with best practices and it avoids forgetting some of them. Recently, I have started several little projects written in C for a customer. Each project was bootstrapped from the previous one. I thought it would be useful to start a template that I could reuse easily. Hence, bootstrap.c¹, a template for simple projects written in C with the autotools, was born.

Usage A new project can be created from this template in three steps:

Run Cookiecutter, a command-line tool to create projects from project templates, and answer the questions.

Setup Git.

Complete the todo list .

Cookiecutter Cookiecutter is a new tool to create projects from project templates. It uses Jinja2 as a template engine for file names and contents. It is language agnostic: you can use it for Python, HTML, Javascript or C! Cookiecutter is quite simple. You can read an introduction from Daniel Greenfeld. The Debian package is currently waiting in the NEW queue and should be available in a few weeks in Debian Sid. You can also install it with pip. Bootstrapping a new project is super easy:

$ cookiecutter https://github.com/vincentbernat/bootstrap.c.git
Cloning into 'bootstrap.c'...
remote: Counting objects: 90, done.
remote: Compressing objects: 100% (68/68), done.
remote: Total 90 (delta 48), reused 64 (delta 22)
Unpacking objects: 100% (90/90), done.
Checking connectivity... done
full_name (default is "Vincent Bernat")? Alfred Thirsty
email (default is "bernat@luffy.cx")? alfred@thirsty.eu
repo_name (default is "bootstrap")? secretproject
project_name (default is "bootstrap")? secretproject
project_description (default is "boilerplate for small C programs with autotools")? Super secret project for humans

Cookiecutter asks a few questions to instantiate the template correctly. The result has been stored in the supersecret directory:

.
  autogen.sh
  configure.ac
  get-version
  m4
      ax_cflags_gcc_option.m4
      ax_ld_check_flag.m4
  Makefile.am
  README.md
  src
      log.c
      log.h
      Makefile.am
  secretproject.8
  secretproject.c
  secretproject.h
2 directories, 13 files

Remaining steps There are still some steps to be executed manually. You first need to initalize Git, as some features of this template rely on it:

$ git init
Initialized empty Git repository in /home/bernat/tmp/secretproject/.git/
$ git add .
$ git commit -m "Initial import"
[...]

Then, you need to extract the todo list built from the comments contained in source files:

$ git ls-tree -r --name-only HEAD   \
>   xargs grep -nH "T[O]DO:"   \
>   sed 's/\([^:]*:[^:]*\):\(.*\)T[O]DO:\(.*\)/\3 (\1)/'   \
>   sort -ns   \
>   awk '(last != $1)  print ""   last=$1 ; print '
2003 Add the dependencies of your project here. (configure.ac:52)
2003 The use of "Jansson" here is an example, you don't have (configure.ac:53)
2003 to keep it. (configure.ac:54)
2004 Each time you have used  PKG_CHECK_MODULES  macro (src/Makefile.am:12)
2004 in  configure.ac , you get two variables that (src/Makefile.am:13)
2004 you can substitute like above. (src/Makefile.am:14)
3000 It's time for you program to do something. Add anything (src/secretproject.c:76)
3000 you want here. */ (src/secretproject.c:77)
[...]

Only a few minutes are needed to complete those steps.

What do you get? Here are the main features:

Minimal `configure.ac` and `Makefile.am`.

Changelog based on Git logs and automatic version from Git tags².

Manual page skeleton.

Logging infrastructure with variadic functions like `log_warn()`, `log_info()`.

About the use of the autotools The autotools are a suite of tools to provide a build system for a project, including:

`autoconf` to generate a configure script, and

`automake` to generate makefiles using a similar but higher-level language.

Understanding the autotools can be a quite difficult task. There are a lot of bad documentations on the web and the manual does not help by describing corner-cases that would be useful if you wanted your project to compile for HP-UX. So, why do I use it?

I have invested a lot of time in the understanding of this build system. Once you grasp how it should be used, it works reasonably well and can cover most of your needs. Maybe CMake would be a better choice but I have yet to learn it. Moreover, the autotools are so widespread that you have to know how they work.

There are a lot of macros available for `autoconf`. Many of them are included in the GNU Autoconf Archive and ready to use. The quality of such macros are usually quite good. If you need to correctly detect the appropriate way to compile a program with GNU Readline or something compatible, there is a macro for that.

If you want to learn more about the autotools, do not read the manual. Instead, have a look at Autotools Mythbuster. Start with a minimal `configure.ac` and do not add useless macros: a macro should be used only if it solves a real problem. Happy hacking!

Retrospectively, I think `boilerplate.c` would have been a better name.

For more information on those features, have a look at their presentation in a previous post about lldpd.

18 July 2013

Vincent Bernat: Packaging a daemon for OS X

There are three main ways to distribute a command-line daemon for OS X:

Distributing source code and instructions on how to compile it.
Using a third-party package manager, like Homebrew.
Providing an installer package.

Homebrew Homebrew is a popular package management system. It works like the BSDs ports collections by downloading, compiling and installing the requested software, while also installing any required dependencies automatically. Creating a new package is quite easy and there are a lot of examples available. However, there are some limitations:

You don t really build a package but execute a recipe to locally install the software.

You need to install development tools, either a whole Xcode installation¹ or the command line version.

If you need to execute some steps as root, you will have to explain them to the user for her to execute by hand. This includes the creation of a system user or the installation of a daemon through `launchd`.

If you can t live with those limitations, you may need to build an installer package.

Building an installer package OS X comes with a graphical and a command-line installer. The graphical one is run by opening a package from the Finder. Then, the user experiences some familiar wizard allowing her to install the software in a matter of a few seconds. The documentation around how to build such a package is somewhat clumsy. You may find outdated information or pieces of information that only apply to projects using Xcode. I will try to provide here accurate bits in the following context:

You are packaging a pure command-line tool.

You are using Autoconf and Automake as a build system.

You want to support several architectures.

You want to support older versions of OS X.

Creating a package Building such a package was previously done with some graphical tool called PackageMaker. This tool is not available any more and developers are asked to switch to pkgbuild and productbuild. There is a very neat Stackoverflow article on how those tools work. A package is built in two steps:

Build component packages.
Combine them into a product archive.

A component package contains a set of files and a set of scripts to execute at various steps of the installation. You can have several component packages, for example a package for the daemon and a package for the client. They are built with pkgbuild. A product archive contains the previously created component packages as well as a file describing various options of the installer (required and optional components, license, welcome text, ). To create a component package, we need to install the needed files in some directory:

$ ./configure --prefix=/usr --sysconfdir=/private/etc
$ make
$ make install DESTDIR=$PWD/osx-pkg

We will now put the content of osx-pkg into a component package with pkgbuild:

$ mkdir pkg1
$ pkgbuild --root osx-pkg \
>    --identifier org.someid.daemon \
>    --version 0.47 \
>    --ownership recommended
>    pkg1/output.pkg
pkgbuild: Inferring bundle components from contents of osx-pkg
pkgbuild: Wrote package to output.pkg

You need to be careful with the identifier. It needs to be unique and identify your software as well as the specific component. Then, you need to create an XML file describing the installer. Let s call it distribution.xml:

<?xml version="1.0" encoding="utf-8" standalone="no"?>
<installer-gui-script minSpecVersion="1">
    <title>Some daemon</title>
    <organization>org.someid</organization>
    <domains enable_localSystem="true"/>
    <options customize="never" require-scripts="true" rootVolumeOnly="true" />
    <!-- Define documents displayed at various steps -->
    <welcome    file="welcome.html"    mime-type="text/html" />
    <license    file="license.html"    mime-type="text/html" />
    <conclusion file="conclusion.html" mime-type="text/html" />
    <!-- List all component packages -->
    <pkg-ref id="org.someid.daemon"
             version="0"
             auth="root">output.pkg</pkg-ref>
    <!-- List them again here. They can now be organized
         as a hierarchy if you want. -->
    <choices-outline>
        <line choice="org.someid.daemon"/>
    </choices-outline>
    <!-- Define each choice above -->
    <choice
        id="org.someid.daemon"
        visible="false"
        title="some daemon"
        description="The daemon"
        start_selected="true">
      <pkg-ref id="org.someid.daemon"/>
    </choice>
</installer-gui-script>

Fortunately, this file is documented on Apple Developer Library. Since we only describe one package, we pass customize="never" as an option to skip the choice part. You can however remove this attribute when you have several component packages. The attribute rootVolumeOnly="true" explains that this daemon can only be installed system-wide. It is marked as deprecated but still works. Its replacement (domains tag) displays an unusual and buggy pane which Apple doesn t use in any of its packages. You need to put the HTML² documents into a resources directory. It is also possible to choose the background by specifying a <background/> tag. The following command will generate the product archive:

$ productbuild --distribution distribution.xml \
>              --resources resources \
>              --package-path pkg1 \
>              --version 0.47 \
>              ../final.pkg
productbuild: Wrote product to ../final.pkg

Scripts The installer can execute some scripts during installation. For example, suppose we would like to register our daemon with launchd for it to be run at system start. You first need to write some org.someid.plist file and ensure it will get installed in /Library/LaunchDaemons:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key>
  <string>org.someid</string>
  <key>ProgramArguments</key>
  <array>
    <string>/usr/sbin/mydaemon</string>
    <string>-d</string>
  </array>
  <key>RunAtLoad</key><true/>
  <key>KeepAlive</key><true/>
</dict>
</plist>

Then create a scripts/postinstall file with the following content:

#!/bin/sh
set -e
/bin/launchctl load "/Library/LaunchDaemons/org.someid.plist"

You also need a scripts/preinstall file which will allow your program to be smoothly upgraded:

#!/bin/bash
set -e
if /bin/launchctl list "org.someid" &> /dev/null; then
    /bin/launchctl unload "/Library/LaunchDaemons/org.someid.plist"
fi

Ensure that those scripts are executable and add --scripts scripts to pkgbuild invocation. If you need to add a system user, use dscl.

Dependencies There is no dependency management built into OS X installer. Therefore, you should ensure that everything is present in the package or in the base system. For example, lldpd relies on libevent, an event notification library. This library is not provided with OS X. When it is not found, the build system will use an embedded copy. However, if you have installed libevent with Homebrew, you will get binaries linked to this local libevent installation. Your package won t work on another host. The output of otool -L can detect unwanted dependencies:

$ otool -L build/usr/sbin/lldpd
build/usr/sbin/lldpd:
        /usr/lib/libresolv.9.dylib (compatibility version 1.0.0, current version 1.0.0)
        /System/Library/Frameworks/IOKit.framework/Versions/A/IOKit (compatibility version 1.0.0, current version 275.0.0)
        /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation (compatibility version 150.0.0, current version 744.19.0)
        /System/Library/Frameworks/Foundation.framework/Versions/C/Foundation (compatibility version 300.0.0, current version 945.18.0)
        /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 169.3.0)

Compilation for older versions of OS X The package as built above may only work on the same version of OS X that you use to compile. If you want a package that will work on OS X 10.6, you will need to download the appropriate SDK³. Then, you need to tell the compiler the target version of OS X and the SDK location. You can do this by setting CFLAGS and LDFLAGS (and CXXFLAGS if you use C++):

$ SDK=/Developer/SDKs/MacOSX10.6.sdk
$ ./configure --prefix=/usr --sysconfdir=/private/etc \
>         CFLAGS="-mmacosx-version-min=10.6 -isysroot $SDK" \
>         LDFLAGS="-mmacosx-version-min=10.6 -isysroot $SDK"

The -mmacosx-version-min flag will be used by various macros in the headers to mark some functions as available or not, depending on the version of OS X you target. It will also be used by the linker. The -isysroot will tell the location of the SDK. Headers and libraries will be looked up in this location first.

Universal binaries Mac OS X 10.6 was available for both IA-32 and x86-64 architectures. If you want to support both installations with a single package, you need to build universal binaries. The Mach object file format, used by OS X, allows several versions of the executable in the same file. The operating system will select the most appropriate one. An easy way to generate such files is to pass

-arch i386 -arch
x86_64

to the compiler:

$ ./configure --prefix=/usr --sysconfdir=/private/etc \
>     CC="gcc -arch i386 -arch x86_64" \
>     CPP="gcc -E"

However, this is a dangerous option. Suppose that your ./configure script tries to determine some architecture-dependent parameter, like the size of an integer (with AC_CHECK_SIZEOF); it will compute it for the host architecture. The binary generated for the other architecture will therefore use a wrong value and may crash at some point. The correct way to generate an universal binary is to execute two separate compilations and to build the universal binary with lipo:

$ for arch in i386 x86_64; do
>   mkdir $arch ; cd $arch
>   ../configure --prefix=/usr --sysconfdir=/private/etc \
>       CC="gcc -arch $ARCH" \
>       CPP="gcc -E"
>   make
>   make install DESTDIR=$PWD/../target-$ARCH
>   cd ..
> done
[...]
$ lipo -create -output daemon i386/usr/sbin/daemon x86_64/usr/sbin/daemon

Since lipo only works on file, I have written a Python script applying lipo recursively to several directories.

Putting everything together Now, we need to automate a bit. You could provide some nifty script. I propose an appropriate Makefile.am instead. This way, we can use the output variables, like @VERSION@, from ./configure to generate some of the files, like distribution.xml. You need to add that in your configure.ac:

AC_CONFIG_FILES([osx/Makefile osx/distribution.xml osx/im.bernat.lldpd.plist])
AC_CONFIG_FILES([osx/scripts/preinstall],  [chmod +x osx/scripts/preinstall])
AC_CONFIG_FILES([osx/scripts/postinstall], [chmod +x osx/scripts/postinstall])
AC_SUBST([CONFIGURE_ARGS], [$ac_configure_args])

Let s have a look at osx/Makefile.am. First, we define some variables:

PKG_NAME=@PACKAGE@-@VERSION@.pkg
PKG_TITLE=@PACKAGE@ @VERSION@
PKG_DIR=@PACKAGE@-@VERSION@
ARCHS=@host_cpu@

If we want to build for several architectures, we will type

make
ARCHS="x86_64 i386"

. Otherwise it defaults to the current host s architecture. We use install-data-local target to install files specific to OS X:

install-data-local:
    install -m 0755 -d $(DESTDIR)/Library/LaunchDaemons
    install -m 0644 im.bernat.@PACKAGE@.plist $(DESTDIR)/Library/LaunchDaemons
uninstall-local:
    rm -f $(DESTDIR)/Library/LaunchDaemons/im.bernat.@PACKAGE@.plist

The main target is the product archive, built with productbuild:

../$(PKG_NAME): pkg.1/$(PKG_NAME) distribution.xml resources
    $(PRODUCTBUILD) \
        --distribution distribution.xml \
        --resources resources \
        --package-path pkg.1 \
        --version @VERSION@ \
        $@

Its main dependency is the component package:

pkg.1/$(PKG_NAME): $(PKG_DIR) scripts
    [ -d pkg.1 ]   mkdir pkg.1
    $(PKGBUILD) \
        --root $(PKG_DIR) \
        --identifier im.bernat.@PACKAGE@.daemon \
        --version @VERSION@ \
        --ownership recommended \
        --scripts scripts \
        $@

Now, we need to build $(PKG_DIR):

$(PKG_DIR): stamp-$(PKG_DIR)
stamp-$(PKG_DIR): $(ARCHS:%=%/$(PKG_DIR))
    $(srcdir)/lipo $(PKG_DIR) $^
    touch $@

The $(ARCHS:%=%/$(PKG_DIR)) will expand to

x86_64/$(PKG_DIR)
i386/$(PKG_DIR)

. This is a substitution reference⁴. Before applying our lipo script, we need to be able to build the architecture-dependent package directories:

pkg_curarch = $(@:stamp-%=%)
$(ARCHS:%=%/$(PKG_DIR)): %/$(PKG_DIR): stamp-%
$(ARCHS:%=stamp-%): stamp-%: im.bernat.lldpd.plist
    [ -d $(pkg_curarch) ]   mkdir -p $(pkg_curarch)
    (cd $(pkg_curarch) && \
        $(abs_top_srcdir)/configure @CONFIGURE_ARGS@ \
            CC="@CC@ -arch $(pkg_curarch)" \
            CPP="@CPP@")
    (cd $(pkg_curarch) && \
        $(MAKE) install DESTDIR=$(abs_builddir)/$(pkg_curarch)/$(PKG_DIR))
    touch $@

This may seem a bit obscure, but this is really like what we have described previously. Note that we use @CONFIGURE_ARGS@ which is a variable we have defined in configure.ac. Here is how a user would create its own OS X package:

$ SDK=/Developer/SDKs/MacOSX10.6.sdk
$ mkdir build && cd build
$ ../configure --prefix=/usr --sysconfdir=/private/etc --with-embedded-libevent \
   CFLAGS="-mmacosx-version-min=10.6 -isysroot $SDK" \
   LDFLAGS="-mmacosx-version-min=10.6 -isysroot $SDK"
[...]
$ make -C osx pkg ARCHS="i386 x86_64"
[...]
productbuild: Wrote product to ../lldpd-0.7.5-21-g5b90c4f-dirty.pkg
The package has been built in ../lldpd-0.7.5-21-g5b90c4f-dirty.pkg.

Even if Xcode is free, you will need to register to Apple and show a valid credit card number. That s quite an interesting requirement.
It is possible to use other formats (like RTF or just plain text) for documents but HTML seems the best choice if you want some formatting without fiddling with RTF.
Unfortunately, Apple makes it difficult to find them. The easiest way to get the SDK for Mac OS X 10.6 is to download Xcode 4.3.3, mount the image and copy the SDK to /Developer/SDKs.
This is equivalent to the use of patsubst function. However, this function is specific to GNU make. The substitution reference pattern was accepted recently in POSIX and is coined as pattern macro expansion.

28 June 2013

Vincent Bernat: lan o: a task launcher powered by cgroups

A few months ago, I was looking for a piece of software to spawn long-running tasks on behalf of some daemon with the benefit of the tasks not being interrupted when this daemon is restarted. Here are the specifications:

No special privilege is needed to submit a task.
Submitted tasks are not known beforehand.
Task output is redirect to a log file.
Submitted tasks are identified by a provided name.
With only the name as a reference, a task can be checked for existence or killed.
Tasks should not be interrupted by an unrelated event, like a configuration change or a software upgrade.

The last requirement explains why tasks are not spawned directly by the daemon requesting them: its complexity or the way it needs to be operated may need frequent restarts. Even if it is possible to re-execute a daemon while keeping its children, like stateful re-exec support in Upstart, this is quite difficult: internal state should be serialized and restored. Part of this state can be contained into a third-party library. While Upstart or systemd seemed to be good candidates for this purpose, I didn t find a straightforward way to run arbitrary tasks without any privilege¹. Here comes lan o². It is a very simple task launcher. It can run any tasks, stop them and check if they are still running. It leverages cgroups in recent Linux kernels and avoids the use of any daemon.

Quick tour Before looking at how lan o works, let s have a look at how it can be used. To avoid usage conflicts, each task is run in the context of a namespace that needs to be initialized:

$ sudo lanco testns init -u $(id -un) -g $(id -gn)

This is the only command that needs to be run as root. Subsequent ones can be run as a normal user. Let s run some task:

$ lanco testns run first-task openssl speed aes
$ lanco testns check first-task && echo "Still running"
Still running
$ lanco testns ls
testns
   first-task
      28456 openssl speed aes

The output of the task is logged into a file:

$ head -3 /var/log/lanco-testns/task-first-task.log
Doing aes-128 cbc for 3s on 16 size blocks: 8678442 aes-128 cbc's in 2.85s
Doing aes-128 cbc for 3s on 64 size blocks: 2478283 aes-128 cbc's in 2.99s
Doing aes-128 cbc for 3s on 256 size blocks: 628105 aes-128 cbc's in 3.00s

If the task is too long, we can kill it:

$ lanco testns run first-task openssl speed aes
$ lanco testns stop first-task

You cannot run a task that already exists or kill a task that does not exist:

$ lanco testns run first-task openssl speed aes
2013-06-09T22:50:34 [WARN/run] task first-task is already running
$ lanco testns stop second-task
2013-06-09T22:50:45 [WARN/stop] task second-task is not running

Thanks to the use of cgroups, lan o is able to track multiple processes even when they are forking away³:

$ lanco testns run second-task sh -c \
>  "while true; do (sleep 30 &)& sleep 1; done"
$ lanco testns ls
testns
   first-task
      28456 openssl speed aes 
   second-task
      29572 sh -c while true; do (sleep 30 &)& sleep 1; done 
      29575 sleep 30 
      29593 sleep 30 
      29596 sleep 30 
      29599 sleep 30 
      29622 sleep 30 
      29644 sleep 1 
      29645 sleep 30 
  
$ lanco testns stop second-task
$ lanco testns check second-task   echo "Killed!"
Killed!

Also, there is a top-like command (lanco testns top): lanco top

Using cgroups Control groups (cgroups) is a mechanism to partition a set of tasks and their future children into hierarchical groups with a set of parameters.

Hierarchy Let s start with the hierarchical stuff first. To create a new hierarchy, you have to mount the cgroup filesystem on some empty directory. Usually, cgroups hierarchies are setup in /sys/fs/cgroups which is a tmpfs filesystem:

# mount -t tmpfs tmpfs /sys/fs/cgroup -o nosuid,nodev,noexec,relatime,mode=755

Now, we can create our first hierarchy:

# cd /sys/fs/cgroup
# mkdir my-first-hierarchy
# mount -t cgroup cgroup my-first-hierarchy -o name=my-first-hierarchy,none
# ls -1 my-first-hierarchy
cgroup.clone_children
cgroup.event_control
cgroup.procs
notify_on_release
release_agent
tasks

We ll see later the purpose of none. The most interesting file is tasks. It contains the PIDs of all processes in our group. Since there are currently no sub-group, all processes are part of it. Let s create a sub-group and attach a process to it:

# mkdir first-child
# cd first-child
# ls -1
cgroup.clone_children
cgroup.event_control
cgroup.procs
notify_on_release
tasks
# echo $$ > tasks
# cat tasks
23184
23311
# cat /proc/$$/cgroup 
9:name=my-first-hierarchy:/first-child
8:perf_event:/
7:blkio:/
6:net_cls:/
5:freezer:/
4:devices:/
3:cpuacct,cpu:/
2:cpuset:/
1:name=systemd:/user/bernat/1

We have added our shell to the new cgroup. Moreover, all its children will also be part of this group. This explains why we have two tasks: the shell and cat. The last command is quite interesting: for each hierarchy, it shows which cgroup the task belongs to. For a given hierarchy, each task is exactly in one group. The most useful features of lan o are done using just this: a namespace is a named hierarchy and each task is enclosed into a cgroup so it can be tracked properly.

Subsystems A subsystem makes use of the task grouping provided by cgroups to assign resources to a set of tasks. Each available subsystem can only be in one hierarchy, so the usual way to setup things is to assign a hierarchy for each subsystem⁴. Let s have a look at the cpuset subsystem: it allows to assign tasks to specific CPUs and memory nodes (for NUMA systems). Let s suppose we have 4 cores and we want to assign the first core to common system tasks and the three remaining ones to nginx:

# cd /sys/fs/cgroup
# mkdir cpuset
# mount -t cgroup cgroup cpuset -o cpuset
# echo 0-3 > cpuset/cpuset.cpus
# echo 0 > cpuset/cpuset.mems
# mkdir cpuset/system
# echo 0 > cpuset/system/cpuset.cpus
# echo 0 > cpuset/system/cpuset.mems
# for task in $(cat cpuset/tasks); do
>    echo $task > cpuset/system/tasks
> done
# mkdir cpuset/nginx
# echo 1-3 > cpuset/nginx/cpuset.cpus
# echo 0 > cpuset/nginx/cpuset.mems
# for task in $(pidof nginx); do
>    echo $task > cpuset/nginx/tasks
> done

The first mount is not needed if something else (like systemd) has setup the appropriate subsystem. Now, even if some system process is going crazy, it won t affect the performance of your webserver. The kernel comes with a more complete documentation if you need additional details on cgroups. There are high level tools to manipulate them, like the tools provided by libcg, but they are currently quite buggy and invasive.

Use in lan o Here are the features used by lan o:

For each namespace, a specific named hierarchy is created.

By setting the appropriate permissions on the hierarchy, an unprivileged user can create subgroups.

Each submitted task has its own sub-group for tracking purpose.

Actions can be executed when a task terminates by using the release agent mechanism⁵.

The `cpuacct` subsystem is used to track CPU usage: a group is created in this subsystem for this purpose.

Both systemd and Upstart are difficult to run without being PID 1 and without being root. They both support user sessions but this requires to use them as PID 1 as well. I discovered later than runit would have been a good fit: it does not require to be PID 1, it does not need to be run as root and the service directory can be specified by an environment variable. It could have been tweaked to meet the above requirements.

Lan o means launcher in esperanto. The unusual diacritic on the letter c is not present in many web fonts, so the rendering may be a little odd.

Linux cgroups do not provide any facility to kill a whole group. We need to iterate through the processes and kill them one by one. lan o does not try to freeze processes before killing them and therefore may be inefficient against violent fork bombs.

The `cpu` and `cpuacct` subsystems are the notable exception. There seems to be little use to keep them in separate hierarchies, so they are part of the same hierarchy.

Unfortunately, only one global release agent per hierarchy can be used. Since lan o provides with the ability to execute an arbitrary command when a task terminates, the command needs to be stored on the filesystem for the release agent of the namespace to execute it.

20 February 2013

Thorsten Glaser: GNU autotools generated files

On Planet Debian, Vincent Bernat wrote:

The drawback of this approach is that if you rebuild configure from the released tarball, you don t have the git tree and the version will be a date. Just don t do that.

Excuse me This is totally inacceptable. Regenerating files like aclocal.m4 and Makefile.in (for automake), configure (for autoconf), and the likes is one of the absolute duties of a software package. Things will break sooner or later if people do not do that. Additionally, generated files must be remakable from the distfile, so do not break this! May I suggest, constructively, an alternative? (People rightfully, I must admit complain I m just ranting too much.)
When making a release from git, write the git describe output into a file. Then, use that file instead of trying to run the git executable if .git/. is not a directory ( test -d .git/. ). Do not call git, because, in packages, it s either not installed or/and also undesired. Couldn t comment on your blog, but felt strongly enough about this I took the effort of writing a full post of my own. (But thanks for the book recommendation.)

Vincent Bernat: lldpd 0.7.1

A few weeks ago, a new version of lldpd, a 802.1AB (aka LLDP) implementation for various Unices, has been released. LLDP is an industry standard protocol designed to supplant proprietary Link-Layer protocols such as EDP or CDP. The goal of LLDP is to provide an inter-vendor compatible mechanism to deliver Link-Layer notifications to adjacent network devices. In short, LLDP allows you to know exactly on which port is a server (and reciprocally). To illustrate its use, I have made a xkcd-like strip: xkcd-like strip for the use of LLDP

If you would like more information about lldpd, please have a look at its new dedicated website. This blog post is an insight of various technical changes that have affected lldpd since its latest major release one year ago. Lots of C stuff ahead!

Version & changelog UPDATED: Guillem Jover told me how he met the same goals for libbsd :

Save the version from git into `.dist-version` and use this file if it exists. This allows one to rebuild `./configure` from the published tarball without losing the version. This also handles Thorsten Glaser s critic.

Include `CHANGELOG` in `DISTCLEANFILES` variable.

Since this is a better solution, I have adopted the appropriate line of codes from libbsd. The two following sections are partly technically outdated.

Automated version In configure.ac, I was previously using a static version number that I had to increase when releasing:

AC_INIT([lldpd], [0.5.7], [bernat@luffy.cx])

Since the information is present in the git tree, this seems a bit redundant (and easy to forget). Taking the version from the git tree is easy:

AC_INIT([lldpd],
        [m4_esyscmd_s([git describe --tags --always --match [0-9]* 2> /dev/null   date +%F])],
        [bernat@luffy.cx])

If the head of the git tree is tagged, you get the exact tag (0.7.1 for example). If it is not, you get the nearest one, the number of commits since it and part of the current hash (0.7.1-29-g2909519 for example). The drawback of this approach is that if you rebuild configure from the released tarball, you don t have the git tree and the version will be a date. Just don t do that.

Automated changelog Generating the changelog from git is a common practice. I had some difficulties to make it right. Here is my attempt (I am using automake):

dist_doc_DATA = README.md NEWS ChangeLog
.PHONY: $(distdir)/ChangeLog
dist-hook: $(distdir)/ChangeLog
$(distdir)/ChangeLog:
        $(AM_V_GEN)if test -d $(top_srcdir)/.git; then \
          prev=$$(git describe --tags --always --match [0-9]* 2> /dev/null) ; \
          for tag in $$(git tag   grep -E '^[0-9]+(\.[0-9]+) 1, $$'   sort -rn); do \
            if [ x"$$prev" = x ]; then prev=$$tag ; fi ; \
            if [ x"$$prev" = x"$$tag" ]; then continue; fi ; \
            echo "$$prev [$$(git log $$prev -1 --pretty=format:'%ai')]:" ; \
            echo "" ; \
            git log --pretty=' - [%h] %s (%an)' $$tag..$$prev ; \
            echo "" ; \
            prev=$$tag ; \
          done > $@ ; \
        else \
          touch $@ ; \
        fi
ChangeLog:
        touch $@

Changelog entries are grouped by version. Since it is a bit verbose, I still maintain a NEWS file with important changes.

Core

C99 I have recently read 21st Century C which has some good bits and also handles the ecosystem around C. I have definitively adopted designated initializers in my coding style. Being a GCC extension since a long time, this is not a major compatibility problem. Without designated initializers:

struct netlink_req req;
struct iovec iov;
struct sockaddr_nl peer;
struct msghdr rtnl_msg;
memset(&req, 0, sizeof(req));
memset(&iov, 0, sizeof(iov));
memset(&peer, 0, sizeof(peer));
memset(&rtnl_msg, 0, sizeof(rtnl_msg));
req.hdr.nlmsg_len = NLMSG_LENGTH(sizeof(struct rtgenmsg));
req.hdr.nlmsg_type = RTM_GETLINK;
req.hdr.nlmsg_flags = NLM_F_REQUEST   NLM_F_DUMP;
req.hdr.nlmsg_seq = 1;
req.hdr.nlmsg_pid = getpid();
req.gen.rtgen_family = AF_PACKET;
iov.iov_base = &req;
iov.iov_len = req.hdr.nlmsg_len;
peer.nl_family = AF_NETLINK;
rtnl_msg.msg_iov = &iov;
rtnl_msg.msg_iovlen = 1;
rtnl_msg.msg_name = &peer;
rtnl_msg.msg_namelen = sizeof(struct sockaddr_nl);

With designated initializers:

struct netlink_req req =  
    .hdr =  
        .nlmsg_len = NLMSG_LENGTH(sizeof(struct rtgenmsg)),
        .nlmsg_type = RTM_GETLINK,
        .nlmsg_flags = NLM_F_REQUEST   NLM_F_DUMP,
        .nlmsg_seq = 1,
        .nlmsg_pid = getpid()  ,
    .gen =   .rtgen_family = AF_PACKET  
 ;
struct iovec iov =  
    .iov_base = &req,
    .iov_len = req.hdr.nlmsg_len
 ;
struct sockaddr_nl peer =   .nl_family = AF_NETLINK  ;
struct msghdr rtnl_msg =  
    .msg_iov = &iov,
    .msg_iovlen = 1,
    .msg_name = &peer,
    .msg_namelen = sizeof(struct sockaddr_nl)
 ;

Logging Logging in lldpd was not extensive. Usually, when receiving a bug report, I asked the reporter to add some additional `printf()` calls to determine where the problem was. This was clearly suboptimal. Therefore, I have added many `log_debug()` calls with the ability to filter out some of them. For example, to debug interface discovery, one can run lldpd with `lldpd -ddd -D interface`. Moreover, I have added colors when logging to a terminal. This may seem pointless but it is now far easier to spot warning messages from debug ones.

libevent In lldpd 0.5.7, I was using my own `select()`-based event loop. It worked but I didn t want to grow a full-featured event loop inside lldpd. Therefore, I switched to libevent. The minimal required version of libevent is 2.0.5. A convenient way to check the changes in API is to use Upstream Tracker, a website tracking API and ABI changes for various libraries. This version of libevent is not available in many stable distributions. For example, Debian Squeeze or Ubuntu Lucid only have 1.4.13. I am also trying to keep compatibility with very old distributions, like RHEL 2, which does not have a packaged libevent at all. For some users, it may be a burden to compile additional libraries. Therefore, I have included libevent source code in lldpd source tree (as a git submodule) and I am only using it if no suitable system libevent is available. Have a look at `m4/libevent.m4` and `src/daemon/Makefile.am` to see how this is done.

Client

Serialization lldpctl is a client querying lldpd to display discovered neighbors. The communication is done through an Unix socket. Each structure to be serialized over this socket should be described with a string. For example:

#define STRUCT_LLDPD_DOT3_MACPHY "(bbww)"
struct lldpd_dot3_macphy  
        u_int8_t                 autoneg_support;
        u_int8_t                 autoneg_enabled;
        u_int16_t                autoneg_advertised;
        u_int16_t                mau_type;
 ;

I did not want to use stuff like Protocol Buffers because I didn t want to copy the existing structures to other structures before serialization (and the other way after deserialization). However, the serializer in lldpd did not allow to handle reference to other structures, lists or circular references. I have written another one which works by annotating a structure with some macros:

struct lldpd_chassis  
    TAILQ_ENTRY(lldpd_chassis) c_entries;
    u_int16_t        c_index;
    u_int8_t         c_protocol;
    u_int8_t         c_id_subtype;
    char            *c_id;
    int              c_id_len;
    char            *c_name;
    char            *c_descr;
    u_int16_t        c_cap_available;
    u_int16_t        c_cap_enabled;
    u_int16_t        c_ttl;
    TAILQ_HEAD(, lldpd_mgmt) c_mgmt;
 ;
MARSHAL_BEGIN(lldpd_chassis)
MARSHAL_TQE  (lldpd_chassis, c_entries)
MARSHAL_FSTR (lldpd_chassis, c_id, c_id_len)
MARSHAL_STR  (lldpd_chassis, c_name)
MARSHAL_STR  (lldpd_chassis, c_descr)
MARSHAL_SUBTQ(lldpd_chassis, lldpd_mgmt, c_mgmt)
MARSHAL_END;

Only pointers need to be annotated. The remaining of the structure can be serialized with just memcpy()¹. I think there is still room for improvement. It should be possible to add annotations inside the structure and avoid some duplication. Or maybe, using a C parser? Or using the AST output from LLVM?

Library In lldpd 0.5.7, there are two possible entry points to interact with the daemon:

Through SNMP support. Only information available in LLDP-MIB are exported. Therefore, implementation-specific values are not available. Moreover, SNMP support is currently read-only.
Through lldpctl. Thanks to a contribution from Andreas Hofmeister, the output can be requested to be formatted as an XML document.

Integration of lldpd into a network stack was therefore limited to one of those two channels. As an exemple, you can have a look at how Vyatta made the integration using the second solution. To provide a more robust solution, I have added a shared library, liblldpctl, with a stable and well-defined API. lldpctl is now using it. I have followed those directions²:

Consistent naming (all exported symbols are prefixed by lldpctl_). No pollution of the global namespace.
Consistent return codes (on errors, all functions returning pointers are returning NULL, all functions returning integers are returning -1).
Reentrant and thread-safe. No global variables.
One well-documented include file.
Reduce the use of boilerplate code. Don t segfault on NULL, accept integer input as string, provide easy iterators,
Asynchronous API for input/output. The library delegates reading and writing by calling user-provided functions. Those functions can yield their effects. In this case, the user has to callback the library when data is available for reading or writing. It is therefore possible to integrate the library with any existing event-loop. A thin synchronous layer is provided on top of this API.
Opaque types with accessor functions.

Accessing bits of information is done through atoms which are opaque containers of type lldpctl_atom_t. From an atom, you can extract some properties as integers, strings, buffers or other atoms. The list of ports is an atom. A port in this list is also an atom. The list of VLAN present on this port is an atom, as well as each VLAN in this list. The VLAN name is a NULL-terminated string living in the scope of an atom. Accessing a property is done by a handful of functions, like lldpctl_atom_get_str(), using a specific key. For example, here is how to display the list of VLAN assuming you have one port as an atom:

vlans = lldpctl_atom_get(port, lldpctl_k_port_vlans);
lldpctl_atom_foreach(vlans, vlan)  
    vid = lldpctl_atom_get_int(vlan,
                               lldpctl_k_vlan_id));
    name = lldpctl_atom_get_str(vlan,
                                lldpctl_k_vlan_name));
    if (vid && name)
        printf("VLAN %d: %s\n", vid, name);
 
lldpctl_atom_dec_ref(vlans);

Internally, an atom is typed and reference counted. The size of the API is greatly limited thanks to this concept. There are currently more than one hundred pieces of information that can be retrieved from lldpd. Ultimately, the library will also enable the full configuration of lldpd. Currently, many aspects can only be configured through command-line flags. The use of the library does not replace lldpctl which will still be available and be the primary client of the library.

CLI Having a configuration file was requested since a long time. I didn t want to include a parser in `lldpd`: I am trying to keep it small. It was already possible to configure `lldpd` through `lldpctl`. Locations, network policies and power policies were the three items that could be configured this way. So, the next step was to enable `lldpctl` to read a configuration file, parse it and send the result to `lldpd`. As a bonus, why not provide a full CLI accepting the same statements with inline help and completion?

Parsing & completion Because of completion, it is difficult to use a YACC generated parser. Instead, I define a tree where each node accepts a word. A node is defined with this function:

struct cmd_node *commands_new(
    struct cmd_node *,
    const char *,
    const char *,
    int(*validate)(struct cmd_env*, void *),
    int(*execute)(struct lldpctl_conn_t*, struct writer*,
        struct cmd_env*, void *),
    void *);

A node is defined by:

its parent,
an optional accepted static token,
an help string,
an optional validation function and
an optional function to execute if the current token is accepted.

When walking the tree, we maintain an environment which is both a key-value store and a stack of positions in the tree. The validation function can check the environment to see if we are in the right context (we want to accept the keyword foo only once, for example). The execution function can add the current token as a value in the environment but it can also pop the current position in the tree to resume walk from a previous node. As an example, see how nodes for configuration of a coordinate-based location are registered:

/* Our root node */
struct cmd_node *configure_medloc_coord = commands_new(
    configure_medlocation,
    "coordinate", "MED location coordinate configuration",
    NULL, NULL, NULL);
/* The exit node.
   The validate function will check if we have both
   latitude and longitude. */
commands_new(configure_medloc_coord,
    NEWLINE, "Configure MED location coordinates",
    cmd_check_env, cmd_medlocation_coordinate,
    "latitude,longitude");
/* Store latitude. Once stored, we pop two positions
   to go back to the "root" node. The user can only
   enter latitude once. */
commands_new(
    commands_new(
        configure_medloc_coord,
        "latitude", "Specify latitude",
        cmd_check_no_env, NULL, "latitude"),
    NULL, "Latitude as xx.yyyyN or xx.yyyyS",
    NULL, cmd_store_env_value_and_pop2, "latitude");
/* Same thing for longitude */
commands_new(
    commands_new(
        configure_medloc_coord,
        "longitude", "Specify longitude",
        cmd_check_no_env, NULL, "longitude"),
    NULL, "Longitude as xx.yyyyE or xx.yyyyW",
    NULL, cmd_store_env_value_and_pop2, "longitude");

The definition of all commands is still a bit verbose but the system is simple enough yet powerful enough to cover all needed cases.

Readline When faced with a CLI, we usually expect some perks like completion, history handling and help. The most used library to provide such features is the GNU Readline Library. Because this is a GPL library, I have first searched an alternative. There are several of them:

NetBSD Editline library (`libedit`).

Autotool port of the NetBSD Editline library (also `libedit`).

Debian version of the Editline library (`libeditline`).

linenoise, a small and minimal readline library.

Many others.

From an API point of view, the first three libraries support the GNU Readline API. They also have a common native API. Moreover, this native API also handles tokenization. Therefore, I have developed the first version of the CLI with this API³. Unfortunately, I noticed later this library is not very common in the Linux world and is not available in RHEL. Since I have used the native API, it was not possible to fallback to the GNU Readline library. So, let s switch! Thanks to the appropriate macro from the Autoconf Archive (with small modifications), the compilation and linking differences between the libraries are taken care of. Because GNU Readline library does not come with a tokenizer, I had to write one myself. The API is also badly documented and it is difficult to know which symbol is available in which version. I have limited myself to:

`readline()`, `addhistory()`,

`rl_insert_text()`,

`rl_forced_update_display()`,

`rl_bind_key()`

`rl_line_buffer` and `rl_point`.

Unfortunately, the various `libedit` libraries have a noop for `rl_bind_key()`. Therefore, completion and online help is not available with them. I have noticed that most BSD come with GNU Readline library preinstalled, so it could be considered as a system library. Nonetheless, linking with `libedit` to avoid licensing issues is possible and help can be obtained by prefixing the command with `help`.

OS specific support

Netlink on Linux Previously, the list of interfaces was retrieved through `getifaddrs()`. lldpd is now using directly Netlink on Linux. This is not a big change since the GNU C Library already uses it to implement `getifaddrs()` and additional information, like VLAN, are still retrieved through `ioctl()` or sysfs. However, lldpd now gets notified when a change happens and update all interfaces in the next second. Like many other projects, I have written my own Netlink implementation instead of using libnl, a nice collection of libraries providing everything you need to query the kernel through Netlink, including some advanced bits. Why?

The latest version of libnl is still young and its availability in major distributions is scarce. It is not available in Debian Squeeze but will be available in Debian Wheezy. Like libevent, I could circumvent this problem by shipping the library with lldpd and use it when there is not system alternative. But

libnl is licensed under LGPL 2.1. This makes static linking difficult because the license is quite shaddy about static linking being derivative work or not. It is believed that it is authorized under the same provisions as in LGPL 3 which handles the case explicitely. This has been a problem with many projects. For example, OGRE has added an exception for static linking in version 1.6 and switched to MIT license in version 1.7.

I had a short discussion with Thomas Graf about this issue and he seems willing to add a similar exception. This may take some time, but once this is done, I will happily switch to libnl and retrieve more stuff from Netlink.

BSD support Until version 0.7, lldpd was Linux-only. The rewrite to use Netlink was the occasion to abstract interfaces and to port to other OS. The first port was for Debian GNU/kFreeBSD, then for FreeBSD, OpenBSD and NetBSD. They all share the same source code:

getifaddrs() to get the list of interfaces,
bpf(4) to attach to an interface to receive and send packets,
PF_ROUTE socket to be notified when a change happens.

Each BSD has its own ioctl() to retrieve VLAN, bridging and bonding bits but they are quite similar. The code was usually adapted from ifconfig.c. The BSD ports have the same functionalities than the Linux port, except for NetBSD which lacks support for LLDP-MED inventory since I didn t find a simple way to retrieve DMI related information. They also offer greater security by filtering packets sent. Moreover, OpenBSD allows to lock the filters set on the socket:

/* Install write filter (optional) */
if (ioctl(fd, BIOCSETWF, (caddr_t)&fprog) < 0)  
    rc = errno;
    log_info("privsep", "unable to setup write BPF filter for %s",
        name);
    goto end;
 
/* Lock interface */
if (ioctl(fd, BIOCLOCK, (caddr_t)&enable) < 0)  
    rc = errno;
    log_info("privsep", "unable to lock BPF interface %s",
        name);
    goto end;

This is a very nice feature. lldpd is using a privileged process to open the raw socket. The socket is then transmitted to an unprivileged process. Without this feature, the unprivileged process can remove the BPF filters. I have ported the ability to lock a socket filter program to Linux. However, I still have to add a write filter.

OS X support Once FreeBSD was supported, supporting OS X seemed easy. I got sponsored by xcloud.me which provided a virtual Mac server. Making lldpd work with OS X took only two days, including a full hour to guess how to get Apple Xcode without providing a credit card. To help people installing lldpd on OS X, I have also written a lldpd formula for Homebrew which seems to be the most popular package manager for OS X.

Upstart and systemd support Many distributions propose upstart and systemd as a replacement or an alternative for the classic SysV init. Like most daemons, lldpd detaches itself from the terminal and run in the background, by forking twice, once it is ready (for lldpd, this just means we have setup the control socket). While both upstart and systemd can accommodate daemons that behave like this, it is recommended to not fork. How to advertise readiness in this case? With upstart, lldpd will send itself the SIGSTOP signal. upstart will detect this, resume lldpd with SIGCONT and assume it is ready. The code to support upstart is therefore quite simple. Instead of calling daemon(), do this:

const char *upstartjob = getenv("UPSTART_JOB");
if (!(upstartjob && !strcmp(upstartjob, "lldpd")))
    return 0;
log_debug("main", "running with upstart, don't fork but stop");
raise(SIGSTOP);

The job configuration file looks like this:

# lldpd - LLDP daemon
description "LLDP daemon"
start on net-device-up IFACE=lo
stop on runlevel [06]
expect stop
respawn
script
  . /etc/default/lldpd
  exec lldpd $DAEMON_ARGS
end script

systemd provides a socket to achieve the same goal. An application is expected to write READY=1 to the socket when it is ready. With the provided library, this is just a matter of calling sd_notify("READY=1\n"). Since sd_notify() has less than 30 lines of code, I have rewritten it to avoid an external dependency. The appropriate unit file is:

[Unit]
Description=LLDP daemon
Documentation=man:lldpd(8)
[Service]
Type=notify
NotifyAccess=main
EnvironmentFile=-/etc/default/lldpd
ExecStart=/usr/sbin/lldpd $DAEMON_ARGS
Restart=on-failure
[Install]
WantedBy=multi-user.target

OS include files Linux-specific include files were a major pain in previous versions of lldpd. The problems range from missing header files (like `linux/if_bonding.h`) to the use of kernel-only types. Those headers have a difficult history. They were first shipped with the C library but were rarely synced and almost always outdated. They were then extracted from kernel version with almost no change and lagged behind the kernel version used by the released distribution⁴. Today, the problem is acknowledged and is being solved by both the distributions which extract the headers from the packaged kernel and by kernel developers with a separation of kernel-only headers from user-space API headers. However, we still need to handle legacy. A good case is `linux/ethtool.h`:

It can just be absent.

It can use `u8`, `u16` types which are kernel-only types. To work around this issue, type munging can be setup.

It can miss some definition, like `SPEED_10000`. In this case, you either define the missing bits and find yourself with a long copy of the original header interleaved with `#ifdef` or conditionally use each symbol. The latest solution is a burden by itself but it also hinders some functionalities that can be available in the running kernel.

The easy solution to all this mess is to just include the appropriate kernel headers into the source tree of the project. Thanks to Google ripping them for its Bionic C library, we know that copying kernel headers into a program does not create a derivative work.

Therefore, the use of `u_int16_t` and `u_int8_t` types is a left-over of the previous serializer where the size of all members was important.

For more comprehensive guidelines, be sure to check Writing a C library.

Tokenization is not the only advantage of `libedit` native API. The API is also cleaner, does not have a global state and has a better documentation. All the implementations are also BSD licensed.

For example, in Debian Sarge, the Linux kernel was a 2.6.8 (2004) while the kernel headers were extracted from some pre-2.6 kernel.

2 November 2012

Vincent Bernat: Network virtualization with VXLAN

Virtual eXtensible Local Area Network (VXLAN) is a protocol to overlay a virtualized L2 network over an existing IP network with little setup. It is currently described in an Internet-Draft. It adds the following perks to VLANs while still providing isolation:

It uses a 24-bit VXLAN Network Identifier (VNI) which should be enough to address any scale-based concerns of multitenancy.
It wraps L2 frames into UDP datagrams. This allows one to rely on some interesting properties of IP networks like availability and scalability. A VXLAN segment can be extended far beyond the typical reach of today VLANs.

The VXLAN Tunnel End Point (VTEP) originates and terminates VXLAN tunnels. Thanks to a serie of patches from Stephen Hemminger, Linux can now act as a VTEP. Let s see how this works.

About IPv6 When possible, I try to use IPv6 for my labs. This is not the case here for several reasons:

IP multicast is required and PIM-SM implementations for IPv6 are not widespread yet. However, they exist. This explains why I use XORP for this lab: it supports PIM-SM for both IPv4 and IPv6.

VXLAN Internet-Draft specifically addresses only IPv4. This seems a bit odd for a protocol running on top of UDP and I hope this will be fixed soon. This is not a major stopper since some VXLAN implementations support IPv6.

However, the current implementation for Linux does not support IPv6. IPv6 support will be added later.

Once IPv6 support is available, the lab should be easy to adapt.

Lab So, here is the lab used. R1, R2 and R3 will act as VTEPs. They do not make use of PIM-SM. Instead, they have a generic multicast route on eth0. E1, E2 and E3 are edge routers while C1, C2 and C3 are core routers. The proposed lab is not resilient but convenient to explain how things work. It is built on top of KVM hosts. Have a look at my previous article for more details on this. VXLAN lab

The lab is hosted on GitHub. I have made the lab easier to try by including the kernel I have used for my tests. XORP comes preconfigured, you just have to configure the VXLAN part. For this, you need a recent version of ip.

$ sudo apt-get install screen vde2 kvm iproute xorp git
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git
$ cd iproute2
$ ./configure && make
You get  ip' as  ip/ip' and  bridge' as  bridge/bridge'.
$ cd ..
$ git clone git://github.com/vincentbernat/network-lab.git
$ cd network-lab/lab-vxlan
$ ./setup

Unicast routing The first step is to setup unicast routing. OSPF is used for this purpose. The chosen routing daemon is XORP. With xorpsh, we can check if OSPF is working as expected:

root@c1# xorpsh
root@c1$ show ospf4 neighbor   
  Address         Interface             State      ID              Pri  Dead
192.168.11.11    eth0/eth0              Full      3.0.0.1          128    36
192.168.12.22    eth1/eth1              Full      3.0.0.2          128    33
192.168.101.133  eth2/eth2              Full      2.0.0.3          128    36
192.168.102.122  eth3/eth3              Full      2.0.0.2          128    38
root@c1$ show route table ipv4 unicast ospf   
192.168.1.0/24  [ospf(110)/2]
                > to 192.168.11.11 via eth0/eth0
192.168.2.0/24  [ospf(110)/2]
                > to 192.168.12.22 via eth1/eth1
192.168.3.0/24  [ospf(110)/3]
                > to 192.168.102.122 via eth3/eth3
192.168.13.0/24 [ospf(110)/2]
                > to 192.168.102.122 via eth3/eth3
192.168.21.0/24 [ospf(110)/2]
                > to 192.168.101.133 via eth2/eth2
192.168.22.0/24 [ospf(110)/2]
                > to 192.168.12.22 via eth1/eth1
192.168.23.0/24 [ospf(110)/2]
                > to 192.168.101.133 via eth2/eth2
192.168.103.0/24        [ospf(110)/2]
                > to 192.168.102.122 via eth3/eth3

Multicast routing Once unicast routing is up and running, we need to setup multicast routing. There are two protocols for this: IGMP and PIM-SM. The former one allows routers to forward multicast datagrams while the first one allows hosts to subscribe to a multicast group.

IGMP IGMP is used by hosts and adjacent routers to establish multicast group membership. In our case, it will be used by R2 to let E2 know it subscribed to 239.0.0.11 (a multicast group). Configuring XORP to support IGMP is simple. Let s test with iperf to have a multicast listener on R2:

root@r2# iperf -u -s -l 1000 -i 1 -B 239.0.0.11
------------------------------------------------------------
Server listening on UDP port 5001
Binding to local address 239.0.0.11
Joining multicast group  239.0.0.11
Receiving 1000 byte datagrams
UDP buffer size:  208 KByte (default)
------------------------------------------------------------

On E2, we can now check that R2 is properly registered for 239.0.0.11:

root@e2$ show igmp group
Interface    Group           Source          LastReported Timeout V State
eth0         239.0.0.11      0.0.0.0         192.168.2.2      248 2     E

XORP documentation contains a good overview of IGMP.

PIM-SM PIM-SM is far more complex. It does not have its own topology discovery protocol and relies on routing information from other protocols, OSPF in our case. I will describe here a simplified view on how PIM-SM works. XORP documentation contains more details about PIM-SM. The first step for all PIM-SM routers is to elect a rendez-vous point (RP). In our lab, only C1, C2 and C3 have been configured to be elected as a RP. Moreover, we give better priority to C3 to ensure it wins. RP election

root@e1$ show pim rps   
RP              Type      Pri Holdtime Timeout ActiveGroups GroupPrefix       
192.168.101.133 bootstrap 100      150     135            0 239.0.0.0/8

Let s suppose we start iperf on both R2 and R3. Using IGMP, they subscribe to multicast group 239.0.0.11 with E2 and E3 respectively. Then, E2 and E3 send a join message (also known as a (*,G) join) to the RP (C3) for that multicast group. Using the unicast path from E2 and E3 to the RP, the routers along the paths build the RP tree (RPT), rooted at C3. Each router in the tree knows how to send multicast packets to 239.0.0.11: it will send them to the leaves. RP tree

root@e3$ show pim join   
Group           Source          RP              Flags
239.0.0.11      0.0.0.0         192.168.101.133 WC   
    Upstream interface (RP):   eth2
    Upstream MRIB next hop (RP): 192.168.23.133
    Upstream RPF'(*,G):        192.168.23.133
    Upstream state:            Joined 
    Join timer:                5
    Local receiver include WC: O...
    Joins RP:                  ....
    Joins WC:                  ....
    Join state:                ....
    Prune state:               ....
    Prune pending state:       ....
    I am assert winner state:  ....
    I am assert loser state:   ....
    Assert winner WC:          ....
    Assert lost WC:            ....
    Assert tracking WC:        O.O.
    Could assert WC:           O...
    I am DR:                   O..O
    Immediate olist RP:        ....
    Immediate olist WC:        O...
    Inherited olist SG:        O...
    Inherited olist SG_RPT:    O...
    PIM include WC:            O...

Let s suppose that R1 wants to send multicast packets to 239.0.0.11. It sends them to R1 which does not have any information on how to contact all the members of the multicast group because it is not the RP. Therefore, it encapsulates the multicast packets into PIM Register packets and sends them to the RP. The RP decapsulates them and sends them natively. The multicast packets are routed from the RP to R2 and R3 using the reverse path formed by the join messages. Multicast routing via the RP

root@r1# iperf -c 239.0.0.11 -u -b 10k -t 30 -T 10
------------------------------------------------------------
Client connecting to 239.0.0.11, UDP port 5001
Sending 1470 byte datagrams
Setting multicast TTL to 10
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
root@e1# tcpdump -pni eth0
10:58:23.424860 IP 192.168.1.1.35277 > 239.0.0.11.5001: UDP, length 1470
root@c3# tcpdump -pni eth0
10:58:23.552903 IP 192.168.11.11 > 192.168.101.133: PIMv2, Register, length 1480
root@e2# tcpdump -pni eth0
10:58:23.896171 IP 192.168.1.1.35277 > 239.0.0.11.5001: UDP, length 1470
root@e3# tcpdump -pni eth0
10:58:23.824647 IP 192.168.1.1.35277 > 239.0.0.11.5001: UDP, length 1470

As presented here, the routing is not optimal: packets from R1 to R2 could avoid the RP. Moreover, encapsulating multicast packets into unicast packets is not efficient either. At some point, the RP will decide to switch to native multicast¹. Rooted at R1, the shortest-path tree (SPT) for the multicast group will be built using source-specific join messages (also known as a (S,G) join). Multicast routing without RP

From here, each router in the tree knows how to handle multicast packets from R1 to the group without involving the RP. For example, E1 knows it must duplicate the packet and sends one through the interface to C3 and the other one through the interface to C1:

root@e1$ show pim join   
Group           Source          RP              Flags
239.0.0.11      192.168.1.1     192.168.101.133 SG SPT DirectlyConnectedS 
    Upstream interface (S):    eth0
    Upstream interface (RP):   eth1
    Upstream MRIB next hop (RP): 192.168.11.111
    Upstream MRIB next hop (S):  UNKNOWN
    Upstream RPF'(S,G):        UNKNOWN
    Upstream state:            Joined 
    Register state:            RegisterPrune RegisterCouldRegister 
    Join timer:                7
    KAT(S,G) running:          true
    Local receiver include WC: ....
    Local receiver include SG: ....
    Local receiver exclude SG: ....
    Joins RP:                  ....
    Joins WC:                  ....
    Joins SG:                  .OO.
    Join state:                .OO.
    Prune state:               ....
    Prune pending state:       ....
    I am assert winner state:  ....
    I am assert loser state:   ....
    Assert winner WC:          ....
    Assert winner SG:          ....
    Assert lost WC:            ....
    Assert lost SG:            ....
    Assert lost SG_RPT:        ....
    Assert tracking SG:        OOO.
    Could assert WC:           ....
    Could assert SG:           .OO.
    I am DR:                   O..O
    Immediate olist RP:        ....
    Immediate olist WC:        ....
    Immediate olist SG:        .OO.
    Inherited olist SG:        .OO.
    Inherited olist SG_RPT:    ....
    PIM include WC:            ....
    PIM include SG:            ....
    PIM exclude SG:            ....
root@e1$ show pim mfc  
Group           Source          RP             
239.0.0.11      192.168.1.1     192.168.101.133
    Incoming interface :      eth0
    Outgoing interfaces:      .OO.
root@e1$ exit
[Connection to XORP closed]
root@e1# ip mroute
(192.168.1.1, 239.0.0.11)        Iif: eth0       Oifs: eth1 eth2

Setting up VXLAN Once IP multicast is running, setting up VXLAN is quite easy. Here are the software requirements:

A recent kernel. Pick at least 3.7-rc3. You need to enable CONFIG_VXLAN option. You also currently need a patch on top of it to be able to specify a TTL greater than 1 for multicast packets.
A recent version of ip. Currently, you need the version from git.

On R1, R2 and R3, we create a vxlan42 interface with the following commands:

root@rX# ./ip link add vxlan42 type vxlan id 42 \
>                               group 239.0.0.42 \
>                               ttl 10 dev eth0
root@rX# ip link set up dev vxlan42
root@rX# ./ip -d link show vxlan42
10: vxlan42: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UNKNOWN mode DEFAULT 
link/ether 3e:09:1c:e1:09:2e brd ff:ff:ff:ff:ff:ff
vxlan id 42 group 239.0.0.42 dev eth0 port 32768 61000 ttl 10 ageing 300

Let s assign an IP in 192.168.99.0/24 for each router and check they can ping each other:

root@r1# ip addr add 192.168.99.1/24 dev vxlan42
root@r2# ip addr add 192.168.99.2/24 dev vxlan42
root@r3# ip addr add 192.168.99.3/24 dev vxlan42
root@r1# ping 192.168.99.2                    
PING 192.168.99.2 (192.168.99.2) 56(84) bytes of data.
64 bytes from 192.168.99.2: icmp_req=1 ttl=64 time=3.90 ms
64 bytes from 192.168.99.2: icmp_req=2 ttl=64 time=1.38 ms
64 bytes from 192.168.99.2: icmp_req=3 ttl=64 time=1.82 ms
--- 192.168.99.2 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 1.389/2.375/3.907/1.098 ms

We can check the packets are encapsulated:

root@r1# tcpdump -pni eth0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
11:30:36.561185 IP 192.168.1.1.43349 > 192.168.2.2.8472: UDP, length 106
11:30:36.563179 IP 192.168.2.2.33894 > 192.168.1.1.8472: UDP, length 106
11:30:37.562677 IP 192.168.1.1.43349 > 192.168.2.2.8472: UDP, length 106
11:30:37.564316 IP 192.168.2.2.33894 > 192.168.1.1.8472: UDP, length 106

Moreover, if we send broadcast packets (with ping -b or ARP requests), they are encapsulated into multicast packets:

root@r1# tcpdump -pni eth0
11:31:27.464198 IP 192.168.1.1.41958 > 239.0.0.42.8472: UDP, length 106
11:31:28.463584 IP 192.168.1.1.41958 > 239.0.0.42.8472: UDP, length 106

Recent versions of iproute also comes with bridge, an utility allowing one to inspect the FDB of bridge-like interfaces:

root@r1# ../bridge/bridge fdb show vxlan42
3e:09:1c:e1:09:2e dev vxlan42 dst 192.168.2.2 self 
0e:98:40:c6:58:10 dev vxlan42 dst 192.168.3.3 self

Demo For a demo, have a look at the following video (it is also available as an Ogg Theora video).
<iframe frameborder="0" height="270" src="http://www.dailymotion.com/embed/video/xusell" width="480"></iframe>

The decision is usually done when the bandwidth used by the follow reachs some threshold. With XORP, this can be controlled with `switch-to-spt-threshold`. However, I was unable to make this works as expected. XORP never sends the appropriate PIM packets to make the switch. Therefore, for this lab, it has been configured to switch to native multicast at the first received packet.

20 October 2012

Vincent Bernat: Network lab with KVM

To experiment with network stuff, I was using UML-based network labs. Many alternatives exist, like GNS3, Netkit, Marionnet or Cloonix. All of them are great viable solutions but I still prefer to stick to my minimal home-made solution with UML virtual machines. Here is why:

I didn t want to use disk images. They take a lot of space and they have to be maintained. They also become cluttered, especially if you try to reuse them across several labs. They are also difficult to share.
I want to be able to access my home directory. It contains the important configuration files related to the lab and I can put them in the right place thanks to symbolic links when the lab starts. It also makes exchanging files with the lab quite easy.
I don t want to boot a complete system. This allows me to be cheap on memory and each virtual system should boot in a few seconds.

The use of UML had some drawbacks:

It may be buggy. For example, it is currently not possible to use gdbserver inside UML without a patch. Sometimes, the kernel won t even compile.
It is slow.

However, UML features HostFS, a filesystem providing access to any part of the host filesystem. This is the killer feature which allows me to not use any virtual disk image and to get access to my home directory right from the guest. I discovered recently that KVM provided 9P, a similar filesystem on top of VirtIO, the paravirtualized IO framework.

Setting up the lab The setup of the lab is done with a single self-contained shell file. The layout is similar to what I have done with UML. I will only highlight here the most interesting steps.

Booting KVM with a minimal kernel My initial goal was to experiment with Nicolas Dichtel s IPv6 ECMP patch. Therefore, I needed to configure a custom kernel. I have started from make defconfig, removed everything that was not necessary, added what I needed for my lab (mostly network stuff) and added the appropriate options for VirtIO drivers:

CONFIG_NET_9P_VIRTIO=y
CONFIG_VIRTIO_BLK=y
CONFIG_VIRTIO_NET=y
CONFIG_VIRTIO_CONSOLE=y
CONFIG_HW_RANDOM_VIRTIO=y
CONFIG_VIRTIO=y
CONFIG_VIRTIO_RING=y
CONFIG_VIRTIO_PCI=y
CONFIG_VIRTIO_BALLOON=y
CONFIG_VIRTIO_MMIO=y

No modules. Grab the complete configuration if you want to have a look. From here, you can start your kernel with the following command ($LINUX is the appropriate bzImage):

kvm \
  -m 256m \
  -display none \
  -nodefconfig -no-user-config -nodefaults \
  \
  -chardev stdio,id=charserial0,signal=off \
  -device isa-serial,chardev=charserial0,id=serial0 \
  \
  -chardev socket,id=con0,path=$TMP/vm-$name-console.pipe,server,nowait \
  -mon chardev=con0,mode=readline,default \
  \
  -kernel $LINUX \
  -append "init=/bin/sh console=ttyS0"

Of course, since there is no disk to boot from, the kernel will panic when trying to mount the root filesystem. KVM is configured to not display video output (-display none). A serial port is defined and uses stdio as a backend¹. The kernel is configured to use this serial port as a console (console=ttyS0). A VirtIO console could have been used instead but it seems this is not possible to make it work early in the boot process. The KVM monitor is setup to listen on an Unix socket. It is possible to connect to it with socat UNIX:$TMP/vm-$name-console.pipe -.

Initial ramdisk UPDATED: I was initially unable to mount the host filesystem as the root filesystem for the guest directly by the kernel. In a comment, Josh Triplett told me to use /dev/root as the mount tag to solve this problem. I keep using an initrd in this post but the lab on Github has been updated to not use one. Here is how to build a small initial ramdisk:

# Setup initrd
setup_initrd()  
    info "Build initrd"
    DESTDIR=$TMP/initrd
    mkdir -p $DESTDIR
    # Setup busybox
    copy_exec $($WHICH busybox) /bin/busybox
    for applet in $($ DESTDIR /bin/busybox --list); do
        ln -s busybox $ DESTDIR /bin/$ applet 
    done
    # Setup init
    cp $PROGNAME $ DESTDIR /init
    cd "$ DESTDIR " && find .   \
       cpio --quiet -R 0:0 -o -H newc   \
       gzip > $TMP/initrd.gz

The copy_exec function is stolen from the initramfs-tools package in Debian. It will ensure that the appropriate libraries are also copied. Another solution would have been to use a static busybox. The setup script is copied as /init in the initial ramdisk. It will detect it has been invoked as such. If it was omitted, a shell would be spawned instead. Remove the cp call if you want to experiment manually. The flag -initrd allows KVM to use this initial ramdisk.

Root filesystem Let s mount our root filesystem using 9P. This is quite easy. First KVM needs to be configured to export the host filesystem to the guest:

kvm \
  $ PREVIOUS_ARGS  \
  -fsdev local,security_model=passthrough,id=fsdev-root,path=$ ROOT ,readonly \
  -device virtio-9p-pci,id=fs-root,fsdev=fsdev-root,mount_tag=rootshare

$ ROOT can either be / or any directory containing a complete filesystem. Mounting it from the guest is quite easy:

mkdir -p /target/ro
mount -t 9p rootshare /target/ro -o trans=virtio,version=9p2000.u

You should find a complete root filesystem inside /target/ro. I have used version=9p2000.u instead of version=9p2000.L because the later does not allow a program to mount() a host mount point². Now, you have a read-only root filesystem (because you don t want to mess with your existing root filesystem and moreover, you did not run this lab as root, did you?). Let s use an union filesystem. Debian comes with AUFS while Ubuntu and OpenWRT have migrated to overlayfs. I was previously using AUFS but got errors on some specific cases. It is still not clear which one will end up in the kernel. So, let s try overlayfs. I didn t find any patchset ready to be applied on top of my kernel tree. I was working with David Miller s net-next tree. Here is how I have applied the overlayfs patch on top of it:

$ git remote add torvalds git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
$ git fetch torvalds
$ git remote add overlayfs git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git
$ git fetch overlayfs
$ git merge-base overlayfs.v15 v3.6
4cbe5a555fa58a79b6ecbb6c531b8bab0650778d
$ git checkout -b net-next+overlayfs
$ git cherry-pick 4cbe5a555fa58a79b6ecbb6c531b8bab0650778d..overlayfs.v15

Don t forget to enable CONFIG_OVERLAYFS_FS in .config. Here is how I configured the whole root filesystem:

info "Setup overlayfs"
mkdir /target
mkdir /target/ro
mkdir /target/rw
mkdir /target/overlay
# Version 9p2000.u allows to access /dev, /sys and mount new
# partitions over them. This is not the case for 9p2000.L.
mount -t 9p        rootshare /target/ro      -o trans=virtio,version=9p2000.u
mount -t tmpfs     tmpfs     /target/rw      -o rw
mount -t overlayfs overlayfs /target/overlay -o lowerdir=/target/ro,upperdir=/target/rw
mount -n -t proc  proc /target/overlay/proc
mount -n -t sysfs sys  /target/overlay/sys
info "Mount home directory on /root"
mount -t 9p homeshare /target/overlay/root -o trans=virtio,version=9p2000.L,access=0,rw
info "Mount lab directory on /lab"
mkdir /target/overlay/lab
mount -t 9p labshare /target/overlay/lab -o trans=virtio,version=9p2000.L,access=0,rw
info "Chroot"
export STATE=1
cp "$PROGNAME" /target/overlay
exec chroot /target/overlay "$PROGNAME"

You have to export your $ HOME and the lab directory from host:

kvm \
  $ PREVIOUS_ARGS  \
  -fsdev local,security_model=passthrough,id=fsdev-root,path=$ ROOT ,readonly \
  -device virtio-9p-pci,id=fs-root,fsdev=fsdev-root,mount_tag=rootshare \
  -fsdev local,security_model=none,id=fsdev-home,path=$ HOME  \
  -device virtio-9p-pci,id=fs-home,fsdev=fsdev-home,mount_tag=homeshare \
  -fsdev local,security_model=none,id=fsdev-lab,path=$(dirname "$PROGNAME") \
  -device virtio-9p-pci,id=fs-lab,fsdev=fsdev-lab,mount_tag=labshare

Network You know what is missing from our network lab? Network setup. For each LAN that I will need, I spawn a VDE switch:

# Setup a VDE switch
setup_switch()  
    info "Setup switch $1"
    screen -t "sw-$1" \
        start-stop-daemon --make-pidfile --pidfile "$TMP/switch-$1.pid" \
        --start --startas $($WHICH vde_switch) -- \
        --sock "$TMP/switch-$1.sock"
    screen -X select 0

To attach an interface to the newly created LAN, I use:

mac=$(echo $name-$net   sha1sum   \
            awk ' print "52:54:" substr($1,0,2) ":" substr($1, 2, 2) ":" substr($1, 4, 2) ":" substr($1, 6, 2) ')
kvm \
  $ PREVIOUS_ARGS  \
  -net nic,model=virtio,macaddr=$mac,vlan=$net \
  -net vde,sock=$TMP/switch-$net.sock,vlan=$net

The use of a VDE switch allows me to run the lab as a non-root user. It is possible to give Internet access to each VM, either by using -net user flag or using slirpvde on a special switch. I prefer the latest solution since it will allow the VM to speak to each others.

Debugging This lab was mostly done to debug both the kernel and Quagga. Each of them can be debugged remotely.

Kernel debugging While the kernel features KGDB, its own debugger, compatible with GDB, it is easier to use the remote GDB server built inside KVM.

kvm \
  $ PREVIOUS_ARGS  \
  -gdb unix:$TMP/vm-$name-gdb.pipe,server,nowait

To connect to the remote GDB server from the host, first locate the vmlinux file at the root of the source tree and run GDB on it. The kernel has to be compiled with CONFIG_DEBUG_INFO=y to get the appropriate debugging symbols. Then, use socat with the Unix socket to attach to the remote debugger:

$ gdb vmlinux
GNU gdb (GDB) 7.4.1-debian
Reading symbols from /home/bernat/src/linux/vmlinux...done.
(gdb) target remote   socat UNIX:$TMP/vm-$name-gdb.pipe -
Remote debugging using   socat UNIX:/tmp/tmp.W36qWnrCEj/vm-r1-gdb.pipe -
native_safe_halt () at /home/bernat/src/linux/arch/x86/include/asm/irqflags.h:50
50   
(gdb)

You can now set breakpoints and resume the execution of the kernel. It is easier to debug the kernel if optimizations are not enabled. However, it is not possible to disable them globally. You can however disable them for some files. For example, to debug net/ipv6/route.c, just add CFLAGS_route.o = -O0 to net/ipv6/Makefile, remove net/ipv6/route.o and type make.

Userland debugging To debug a program inside KVM, you can just use gdb as usual. Your $HOME directory is available and it should be therefore straightforward. However, if you want to perform some remote debugging, that s quite easy. Add a new serial port to KVM:

kvm \
  $ PREVIOUS_ARGS  \
  -chardev socket,id=charserial1,path=$TMP/vm-$name-serial.pipe,server,nowait \
  -device isa-serial,chardev=charserial1,id=serial1

Starts gdbserver in the guest:

$ libtool execute gdbserver /dev/ttyS1 zebra/zebra
Process /root/code/orange/quagga/build/zebra/.libs/lt-zebra created; pid = 800
Remote debugging using /dev/ttyS1

And from the host, you can attach to the remote process:

$ libtool execute gdb zebra/zebra
GNU gdb (GDB) 7.4.1-debian
Reading symbols from /home/bernat/code/orange/quagga/build/zebra/.libs/lt-zebra...done.
(gdb) target remote   socat UNIX:/tmp/tmp.W36qWnrCEj/vm-r1-serial.pipe
Remote debugging using   socat UNIX:/tmp/tmp.W36qWnrCEj/vm-r1-serial.pipe
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
0x00007ffff7dddaf0 in ?? () from /lib64/ld-linux-x86-64.so.2
(gdb)

Demo For a demo, have a look at the following video (it is also available as an Ogg Theora video).
<iframe frameborder="0" height="270" src="http://www.dailymotion.com/embed/video/xuglsg" width="480"></iframe>

`stdio` is configured such that signals are not enabled. KVM won t stop when receiving `SIGINT`. This is important for the usage we want to have.

Therefore, it is not possible to mound a fresh `/proc` on top of the existing one. I have searched a bit but didn t find why. Any comments on this is welcome.

28 July 2012

Steve Kemp: I will be awesome, eventually.

Earlier this year, in March, I switched to the bluetile tiling window, and it has been a great success. So I was interested in seeing Vincent Bernats post about switching to Awesome, I myself intend to do that "eventually", now I've successfully dipped my toes into tiling-land, via bluetiles simple config and gnome-friendly setup. One thing that puts me off is the length of my sessions:

skx@precious:~/hg/blog$ ps -ef   grep skx
skx       2237     1  0 Mar12 ?        00:00:01 /usr/bin/gnome-keyring-daemon ...

As you can see I've been logged into my desktop session for four months. I lock the screen when I wander away, but generally login once when the computer boots and never again for half a year or so. FWIW:

skx@precious:~/hg/blog$ uptime
 23:01:19 up 138 days,  4:44,  4 users,  load average: 0.02, 0.06, 0.02

ObQuote: "I'm 30 years old. I'm almost a grown man. " - Rocketman

Vincent Bernat: Switching to the awesome window manager

I have happily used FVWM as my window manager for more than 10 years. However, I recently got tired of manually arranging windows and using the mouse so much. A window manager is one of the handful pieces of software getting in your way at every moment which explains why there are so many of them and why we might put so much time in it. I decided to try a tiling window manager. While i3 seemed pretty hot and powerful (watch the screencast!), I really wanted something configurable and extensible with some language. So far, the common choices are:

awesome, written in C, configurable and extensible in Lua,
StumpWM, written, configurable and extensible in Common Lisp,
xmonad, written, configurable and extensible in Haskell.

I chose awesome, despite the fact that StumpWM vote for Lisp seemed a better fit (but it is more minimalist). I hope there is some parallel universe where I enjoy StumpWM.

Visually, here is what I got so far:

Awesome configuration Without a configuration file, awesome does nothing. It does not come with any hard-coded behavior: everything needs to be configured through its Lua configuration file. Of course, a default one is provided but you can also start from scratch. If you like to control your window manager, this is somewhat wonderful. awesome is well documented. The wiki provides a FAQ, a good introduction and the API reference is concise enough to be read from the top to the bottom. Knowing Lua is not mandatory since it is quite easy to dive into such a language. I have posted my configuration on GitHub. It should not be used as is but some snippets may be worth to be stolen and adapted into your own configuration. The following sections put light on some notable points.

Keybindings Ten years ago was the epoch of scavanger hunts to recover IBM Model M keyboards from waste containers. They were great to type on and they did not feature the infamous Windows keys. Nowadays, this is harder to get such a keyboard. All my keyboards now have Windows keys. This is a major change with respect to configure a window manager: the left Windows key is mapped to `Mod4` and is usually unused by most applications and can therefore be dedicated to the window manager. The main problem with the ability to define many keybindings is to remember the less frequently used one. I have monkey-patched `awful.key` module to be able to attach a documentation string to a keybinding. I have documented the whole process on the awesome wiki.

Quake console A Quake console is a drop-down terminal which can be toggled with some key. I was heavily relying on it in FVWM. I think this is still a useful addition to any awesome configuration. There are several possible solutions documented in the awesome wiki. I have added my own¹ which works great for me.

XRandR XRandR is an extension which allows to dynamically reconfigure outputs: you plug an external screen to your laptop and you issue some command to enable it:

$ xrandr --output VGA-1 --auto --left-of LVDS-1

awesome detects the change and will restart automatically. Laptops usually come with a special key to enable/disable an external screen. Nowadays, this key does nothing unless configured appropriately. Out of the box, it is mapped to XF86Display symbol. I have associated this key to a function that will cycle through possible configurations depending on the plugged screens. For example, if I plug an external screen to my laptop, I can cycle through the following configurations:

only the internal screen,
only the external screen,
internal screen on the left, external screen on the right,
external screen on the left, internal screen on the right,
no change.

The proposed configuration is displayed using naughty, the notification system integrated in awesome. Notification of screen reconfiguration

Widgets I was previously using Conky to display various system-related information, like free space, CPU usage and network usage. awesome comes with widgets that can fit the same use. I am relying on vicious, a contributed widget manager, to manage most of them. It allows one to attach a function whose task is to fetch values to be displayed. This is quite powerful. Here is an example with a volume widget:

local volwidget = widget(  type = "textbox"  )
vicious.register(volwidget, vicious.widgets.volume,
         '<span font="Terminus 8">$2 $1%</span>',
        2, "Master")
volwidget:buttons(awful.util.table.join(
             awful.button(   , 1, volume.mixer),
             awful.button(   , 3, volume.toggle),
             awful.button(   , 4, volume.increase),
             awful.button(   , 5, volume.decrease)))

You can also use a function to format the text as you wish. For example, you can display a value in red if it is too low. Have a look at my battery widget for an example. Various widgets

Miscellaneous While I was working on my awesome configuration, I also changed some other desktop-related bits.

Keyboard configuration I happen to setup all my keyboards to use the QWERTY layout. I use a compose key to input special characters like . I have also recently use Caps Lock as a Control key. All this is perfectly supported since ages by X11 I am also mapping the Pause key to XF86ScreenSaver key symbol which will in turn be bound to a function that will trigger xautolock to lock the screen. Thanks to a great article about extending the X keyboard map with xkb, I discovered that X was able to switch from one layout to another using groups². I finally opted for this simple configuration:

$ setxkbmap us,fr '' compose:rwin ctrl:nocaps grp:rctrl_rshift_toggle
$ xmodmap -e 'keysym Pause = XF86ScreenSaver'

I switch from us to fr by pressing both left Control and left Shift keys.

Getting rid of most GNOME stuff Less than one year ago, to take a step forward to the future, I started to heavily rely on some GNOME components like GNOME Display Manager, GNOME Power Manager, the screen saver, `gnome-session`, `gnome-settings-daemon` and others. I had numerous problems when I tried to setup everything without pulling the whole GNOME stack. At each GNOME update, something was broken: the screensaver didn t start automatically anymore until a full session restart or some keybindings were randomly hijacked by `gnome-settings-daemon`. Therefore, I have decided to get rid of most of those components. I have replaced GNOME Power Manager with system-level tools like sleepd and the PM utilities. I replaced the GNOME screensaver with i3lock and xautolock. GDM has been replaced by SLiM which now features ConsoleKit support³. I use `~/.gtkrc-2.0` and `~/.config/gtk-3.0/settings.ini` to configure GTK+. The future will wait.

Terminal color scheme I am using rxvt-unicode as my terminal with a black background (and some light transparency). The default color scheme is suboptimal on the readability front. Sharing terminal color schemes seems a popular activity. I finally opted for the derp color scheme which brings a major improvement over the default configuration. Comparison of terminal color schemes

I have also switched to Xft for font rendering using DejaVu Sans Mono as my default font (instead of fixed) with the following configuration in ~/.Xresources:

Xft.antialias: true
Xft.hinting: true
Xft.hintstyle: hintlight
Xft.rgba: rgb
URxvt.font: xft:DejaVu Sans Mono-8
URxvt.letterSpace: -1

The result is less crisp but seems a bit more readable. I may switch back in the future. Comparison of terminal fonts

Next steps My reliance to the mouse has been greatly reduced. However, I still need it for casual browsing. I am looking at luakit a WebKit-based browser extensible with Lua for this purpose.

The console gets its own unique name. This allows awesome to reliably detect when it is spawned, even on restart. It is how the Quake console works in the mod of FVWM I was using.

However, the layout is global, not per-window. If you are interested by a per-window layout, take a look at kbdd.

Nowadays, you cannot really survive without ConsoleKit. Many PolicyKit policies do not rely on groups any more to grant access to your devices.

Next.

Previous.